Interesting. Should requiredClustering return a Set of Expression's ? This way, we can determine the order of Expression's by looking at what requiredOrdering() returns.
On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi Pat, > > Thanks for starting the discussion on this, we’re really interested in it > as well. I don’t think there is a proposed API yet, but I was thinking > something like this: > > interface RequiresClustering { > List<Expression> requiredClustering(); > } > > interface RequiresSort { > List<SortOrder> requiredOrdering(); > } > > The reason why RequiresClustering should provide Expression is that it > needs to be able to customize the implementation. For example, writing to > HTable would require building a key (or the data for a key) and that might > use a hash function that differs from Spark’s built-ins. RequiresSort is > fairly straightforward, but the interaction between the two requirements > deserves some consideration. To make the two compatible, I think that > RequiresSort must be interpreted as a sort within each partition of the > clustering, but could possibly be used for a global sort when the two > overlap. > > For example, if I have a table partitioned by “day” and “category” then > the RequiredClustering would be by day, category. A required sort might > be day ASC, category DESC, name ASC. Because that sort satisfies the > required clustering, it could be used for a global ordering. But, is that > useful? How would the global ordering matter beyond a sort within each > partition, i.e., how would the partition’s place in the global ordering be > passed? > > To your other questions, you might want to have a look at the recent SPIP > I’m working on to consolidate and clean up logical plans > <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. > That proposes more specific uses for the DataSourceV2 API that should help > clarify what validation needs to take place. As for custom catalyst rules, > I’d like to hear about the use cases to see if we can build it into these > improvements. > > rb > > > On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <patrick.woo...@gmail.com> > wrote: > >> Hey all, >> >> I saw in some of the discussions around DataSourceV2 writes that we might >> have the data source inform Spark of requirements for the input data's >> ordering and partitioning. Has there been a proposed API for that yet? >> >> Even one level up it would be helpful to understand how I should be >> thinking about the responsibility of the data source writer, when I should >> be inserting a custom catalyst rule, and how I should handle >> validation/assumptions of the table before attempting the write. >> >> Thanks! >> Pat >> > > > > -- > Ryan Blue > Software Engineer > Netflix >