Re: DataSourceV2 write input requirements

Ted Yu Mon, 26 Mar 2018 19:07:10 -0700

Hmm. Ryan seems to be right.

Looking
at 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java
:


import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning;
...
  Partitioning outputPartitioning();

On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan <cloud0...@gmail.com> wrote:

> Actually clustering is already supported, please take a look at
> SupportsReportPartitioning
>
> Ordering is not proposed yet, might be similar to what Ryan proposed.
>
> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Interesting.
>>
>> Should requiredClustering return a Set of Expression's ?
>> This way, we can determine the order of Expression's by looking at what 
>> requiredOrdering()
>> returns.
>>
>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi Pat,
>>>
>>> Thanks for starting the discussion on this, we’re really interested in
>>> it as well. I don’t think there is a proposed API yet, but I was thinking
>>> something like this:
>>>
>>> interface RequiresClustering {
>>>   List<Expression> requiredClustering();
>>> }
>>>
>>> interface RequiresSort {
>>>   List<SortOrder> requiredOrdering();
>>> }
>>>
>>> The reason why RequiresClustering should provide Expression is that it
>>> needs to be able to customize the implementation. For example, writing to
>>> HTable would require building a key (or the data for a key) and that might
>>> use a hash function that differs from Spark’s built-ins. RequiresSort
>>> is fairly straightforward, but the interaction between the two requirements
>>> deserves some consideration. To make the two compatible, I think that
>>> RequiresSort must be interpreted as a sort within each partition of the
>>> clustering, but could possibly be used for a global sort when the two
>>> overlap.
>>>
>>> For example, if I have a table partitioned by “day” and “category” then
>>> the RequiredClustering would be by day, category. A required sort might
>>> be day ASC, category DESC, name ASC. Because that sort satisfies the
>>> required clustering, it could be used for a global ordering. But, is that
>>> useful? How would the global ordering matter beyond a sort within each
>>> partition, i.e., how would the partition’s place in the global ordering be
>>> passed?
>>>
>>> To your other questions, you might want to have a look at the recent
>>> SPIP I’m working on to consolidate and clean up logical plans
>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>> That proposes more specific uses for the DataSourceV2 API that should help
>>> clarify what validation needs to take place. As for custom catalyst rules,
>>> I’d like to hear about the use cases to see if we can build it into these
>>> improvements.
>>>
>>> rb
>>> 
>>>
>>> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <patrick.woo...@gmail.com
>>> > wrote:
>>>
>>>> Hey all,
>>>>
>>>> I saw in some of the discussions around DataSourceV2 writes that we
>>>> might have the data source inform Spark of requirements for the input
>>>> data's ordering and partitioning. Has there been a proposed API for that
>>>> yet?
>>>>
>>>> Even one level up it would be helpful to understand how I should be
>>>> thinking about the responsibility of the data source writer, when I should
>>>> be inserting a custom catalyst rule, and how I should handle
>>>> validation/assumptions of the table before attempting the write.
>>>>
>>>> Thanks!
>>>> Pat
>>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>

Re: DataSourceV2 write input requirements

Reply via email to