Hi Pat,

Thanks for starting the discussion on this, we’re really interested in it
as well. I don’t think there is a proposed API yet, but I was thinking
something like this:

interface RequiresClustering {
  List<Expression> requiredClustering();
}

interface RequiresSort {
  List<SortOrder> requiredOrdering();
}

The reason why RequiresClustering should provide Expression is that it
needs to be able to customize the implementation. For example, writing to
HTable would require building a key (or the data for a key) and that might
use a hash function that differs from Spark’s built-ins. RequiresSort is
fairly straightforward, but the interaction between the two requirements
deserves some consideration. To make the two compatible, I think that
RequiresSort must be interpreted as a sort within each partition of the
clustering, but could possibly be used for a global sort when the two
overlap.

For example, if I have a table partitioned by “day” and “category” then the
RequiredClustering would be by day, category. A required sort might be day
ASC, category DESC, name ASC. Because that sort satisfies the required
clustering, it could be used for a global ordering. But, is that useful?
How would the global ordering matter beyond a sort within each partition,
i.e., how would the partition’s place in the global ordering be passed?

To your other questions, you might want to have a look at the recent SPIP
I’m working on to consolidate and clean up logical plans
<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
That proposes more specific uses for the DataSourceV2 API that should help
clarify what validation needs to take place. As for custom catalyst rules,
I’d like to hear about the use cases to see if we can build it into these
improvements.

rb
​

On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <patrick.woo...@gmail.com>
wrote:

> Hey all,
>
> I saw in some of the discussions around DataSourceV2 writes that we might
> have the data source inform Spark of requirements for the input data's
> ordering and partitioning. Has there been a proposed API for that yet?
>
> Even one level up it would be helpful to understand how I should be
> thinking about the responsibility of the data source writer, when I should
> be inserting a custom catalyst rule, and how I should handle
> validation/assumptions of the table before attempting the write.
>
> Thanks!
> Pat
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to