Re: DataSourceV2 write input requirements

Ted Yu Mon, 26 Mar 2018 18:12:07 -0700

Interesting.

Should requiredClustering return a Set of Expression's ?
This way, we can determine the order of Expression's by looking at
what requiredOrdering()
returns.


On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Hi Pat,
>
> Thanks for starting the discussion on this, we’re really interested in it
> as well. I don’t think there is a proposed API yet, but I was thinking
> something like this:
>
> interface RequiresClustering {
>   List<Expression> requiredClustering();
> }
>
> interface RequiresSort {
>   List<SortOrder> requiredOrdering();
> }
>
> The reason why RequiresClustering should provide Expression is that it
> needs to be able to customize the implementation. For example, writing to
> HTable would require building a key (or the data for a key) and that might
> use a hash function that differs from Spark’s built-ins. RequiresSort is
> fairly straightforward, but the interaction between the two requirements
> deserves some consideration. To make the two compatible, I think that
> RequiresSort must be interpreted as a sort within each partition of the
> clustering, but could possibly be used for a global sort when the two
> overlap.
>
> For example, if I have a table partitioned by “day” and “category” then
> the RequiredClustering would be by day, category. A required sort might
> be day ASC, category DESC, name ASC. Because that sort satisfies the
> required clustering, it could be used for a global ordering. But, is that
> useful? How would the global ordering matter beyond a sort within each
> partition, i.e., how would the partition’s place in the global ordering be
> passed?
>
> To your other questions, you might want to have a look at the recent SPIP
> I’m working on to consolidate and clean up logical plans
> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
> That proposes more specific uses for the DataSourceV2 API that should help
> clarify what validation needs to take place. As for custom catalyst rules,
> I’d like to hear about the use cases to see if we can build it into these
> improvements.
>
> rb
> 
>
> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <patrick.woo...@gmail.com>
> wrote:
>
>> Hey all,
>>
>> I saw in some of the discussions around DataSourceV2 writes that we might
>> have the data source inform Spark of requirements for the input data's
>> ordering and partitioning. Has there been a proposed API for that yet?
>>
>> Even one level up it would be helpful to understand how I should be
>> thinking about the responsibility of the data source writer, when I should
>> be inserting a custom catalyst rule, and how I should handle
>> validation/assumptions of the table before attempting the write.
>>
>> Thanks!
>> Pat
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 write input requirements

Reply via email to