Re: DataSourceV2 write input requirements

Wenchen Fan Mon, 26 Mar 2018 20:10:41 -0700

Yea it is for read-side only. I think for the write-side, implementations
can provide some options to allow users to set partitioning/ordering, or
the data source has a natural partitioning/ordering which doesn't require
any interface.


On Mon, Mar 26, 2018 at 7:59 PM, Patrick Woody <patrick.woo...@gmail.com>
wrote:

> Hey Ryan, Ted, Wenchen
>
> Thanks for the quick replies.
>
> @Ryan - the sorting portion makes sense, but I think we'd have to ensure
> something similar to requiredChildDistribution in SparkPlan where we have
> the number of partitions as well if we'd want to further report to
> SupportsReportPartitioning, yeah?
>
> Specifying an explicit global sort can also be useful for filtering
> purposes on Parquet row group stats if we have a time based/high
> cardinality ID field. If my datasource or catalog knows about previous
> queries on a table, it could be really useful to recommend more appropriate
> formatting for consumers on the next materialization. The same would be
> true of clustering on commonly joined fields.
>
> Thanks again
> Pat
>
>
>
> On Mon, Mar 26, 2018 at 10:05 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Hmm. Ryan seems to be right.
>>
>> Looking at sql/core/src/main/java/org/apache/spark/sql/sources/v2/re
>> ader/SupportsReportPartitioning.java :
>>
>> import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning;
>> ...
>>   Partitioning outputPartitioning();
>>
>> On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Actually clustering is already supported, please take a look at
>>> SupportsReportPartitioning
>>>
>>> Ordering is not proposed yet, might be similar to what Ryan proposed.
>>>
>>> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Interesting.
>>>>
>>>> Should requiredClustering return a Set of Expression's ?
>>>> This way, we can determine the order of Expression's by looking at what 
>>>> requiredOrdering()
>>>> returns.
>>>>
>>>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi Pat,
>>>>>
>>>>> Thanks for starting the discussion on this, we’re really interested in
>>>>> it as well. I don’t think there is a proposed API yet, but I was thinking
>>>>> something like this:
>>>>>
>>>>> interface RequiresClustering {
>>>>>   List<Expression> requiredClustering();
>>>>> }
>>>>>
>>>>> interface RequiresSort {
>>>>>   List<SortOrder> requiredOrdering();
>>>>> }
>>>>>
>>>>> The reason why RequiresClustering should provide Expression is that
>>>>> it needs to be able to customize the implementation. For example, writing
>>>>> to HTable would require building a key (or the data for a key) and that
>>>>> might use a hash function that differs from Spark’s built-ins.
>>>>> RequiresSort is fairly straightforward, but the interaction between
>>>>> the two requirements deserves some consideration. To make the two
>>>>> compatible, I think that RequiresSort must be interpreted as a sort
>>>>> within each partition of the clustering, but could possibly be used for a
>>>>> global sort when the two overlap.
>>>>>
>>>>> For example, if I have a table partitioned by “day” and “category”
>>>>> then the RequiredClustering would be by day, category. A required
>>>>> sort might be day ASC, category DESC, name ASC. Because that sort
>>>>> satisfies the required clustering, it could be used for a global ordering.
>>>>> But, is that useful? How would the global ordering matter beyond a sort
>>>>> within each partition, i.e., how would the partition’s place in the global
>>>>> ordering be passed?
>>>>>
>>>>> To your other questions, you might want to have a look at the recent
>>>>> SPIP I’m working on to consolidate and clean up logical plans
>>>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>>>> That proposes more specific uses for the DataSourceV2 API that should help
>>>>> clarify what validation needs to take place. As for custom catalyst rules,
>>>>> I’d like to hear about the use cases to see if we can build it into these
>>>>> improvements.
>>>>>
>>>>> rb
>>>>> 
>>>>>
>>>>> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <
>>>>> patrick.woo...@gmail.com> wrote:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> I saw in some of the discussions around DataSourceV2 writes that we
>>>>>> might have the data source inform Spark of requirements for the input
>>>>>> data's ordering and partitioning. Has there been a proposed API for that
>>>>>> yet?
>>>>>>
>>>>>> Even one level up it would be helpful to understand how I should be
>>>>>> thinking about the responsibility of the data source writer, when I 
>>>>>> should
>>>>>> be inserting a custom catalyst rule, and how I should handle
>>>>>> validation/assumptions of the table before attempting the write.
>>>>>>
>>>>>> Thanks!
>>>>>> Pat
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>
>

Re: DataSourceV2 write input requirements

Reply via email to