Wenchen, I thought SupportsReportPartitioning was for the read side. It works with the write side as well?
On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > Actually clustering is already supported, please take a look at > SupportsReportPartitioning > > Ordering is not proposed yet, might be similar to what Ryan proposed. > > On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Interesting. >> >> Should requiredClustering return a Set of Expression's ? >> This way, we can determine the order of Expression's by looking at what >> requiredOrdering() >> returns. >> >> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Hi Pat, >>> >>> Thanks for starting the discussion on this, we’re really interested in >>> it as well. I don’t think there is a proposed API yet, but I was thinking >>> something like this: >>> >>> interface RequiresClustering { >>> List<Expression> requiredClustering(); >>> } >>> >>> interface RequiresSort { >>> List<SortOrder> requiredOrdering(); >>> } >>> >>> The reason why RequiresClustering should provide Expression is that it >>> needs to be able to customize the implementation. For example, writing to >>> HTable would require building a key (or the data for a key) and that might >>> use a hash function that differs from Spark’s built-ins. RequiresSort >>> is fairly straightforward, but the interaction between the two requirements >>> deserves some consideration. To make the two compatible, I think that >>> RequiresSort must be interpreted as a sort within each partition of the >>> clustering, but could possibly be used for a global sort when the two >>> overlap. >>> >>> For example, if I have a table partitioned by “day” and “category” then >>> the RequiredClustering would be by day, category. A required sort might >>> be day ASC, category DESC, name ASC. Because that sort satisfies the >>> required clustering, it could be used for a global ordering. But, is that >>> useful? How would the global ordering matter beyond a sort within each >>> partition, i.e., how would the partition’s place in the global ordering be >>> passed? >>> >>> To your other questions, you might want to have a look at the recent >>> SPIP I’m working on to consolidate and clean up logical plans >>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. >>> That proposes more specific uses for the DataSourceV2 API that should help >>> clarify what validation needs to take place. As for custom catalyst rules, >>> I’d like to hear about the use cases to see if we can build it into these >>> improvements. >>> >>> rb >>> >>> >>> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <patrick.woo...@gmail.com >>> > wrote: >>> >>>> Hey all, >>>> >>>> I saw in some of the discussions around DataSourceV2 writes that we >>>> might have the data source inform Spark of requirements for the input >>>> data's ordering and partitioning. Has there been a proposed API for that >>>> yet? >>>> >>>> Even one level up it would be helpful to understand how I should be >>>> thinking about the responsibility of the data source writer, when I should >>>> be inserting a custom catalyst rule, and how I should handle >>>> validation/assumptions of the table before attempting the write. >>>> >>>> Thanks! >>>> Pat >>>> >>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> > -- Ryan Blue Software Engineer Netflix