Hi Pat, Thanks for starting the discussion on this, we’re really interested in it as well. I don’t think there is a proposed API yet, but I was thinking something like this:
interface RequiresClustering { List<Expression> requiredClustering(); } interface RequiresSort { List<SortOrder> requiredOrdering(); } The reason why RequiresClustering should provide Expression is that it needs to be able to customize the implementation. For example, writing to HTable would require building a key (or the data for a key) and that might use a hash function that differs from Spark’s built-ins. RequiresSort is fairly straightforward, but the interaction between the two requirements deserves some consideration. To make the two compatible, I think that RequiresSort must be interpreted as a sort within each partition of the clustering, but could possibly be used for a global sort when the two overlap. For example, if I have a table partitioned by “day” and “category” then the RequiredClustering would be by day, category. A required sort might be day ASC, category DESC, name ASC. Because that sort satisfies the required clustering, it could be used for a global ordering. But, is that useful? How would the global ordering matter beyond a sort within each partition, i.e., how would the partition’s place in the global ordering be passed? To your other questions, you might want to have a look at the recent SPIP I’m working on to consolidate and clean up logical plans <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. That proposes more specific uses for the DataSourceV2 API that should help clarify what validation needs to take place. As for custom catalyst rules, I’d like to hear about the use cases to see if we can build it into these improvements. rb On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <patrick.woo...@gmail.com> wrote: > Hey all, > > I saw in some of the discussions around DataSourceV2 writes that we might > have the data source inform Spark of requirements for the input data's > ordering and partitioning. Has there been a proposed API for that yet? > > Even one level up it would be helpful to understand how I should be > thinking about the responsibility of the data source writer, when I should > be inserting a custom catalyst rule, and how I should handle > validation/assumptions of the table before attempting the write. > > Thanks! > Pat > -- Ryan Blue Software Engineer Netflix