Re: DataSourceV2 write input requirements

Patrick Woody Mon, 26 Mar 2018 20:00:07 -0700

Hey Ryan, Ted, Wenchen

Thanks for the quick replies.


@Ryan - the sorting portion makes sense, but I think we'd have to ensure
something similar to requiredChildDistribution in SparkPlan where we have
the number of partitions as well if we'd want to further report to
SupportsReportPartitioning, yeah?

Specifying an explicit global sort can also be useful for filtering
purposes on Parquet row group stats if we have a time based/high
cardinality ID field. If my datasource or catalog knows about previous
queries on a table, it could be really useful to recommend more appropriate
formatting for consumers on the next materialization. The same would be
true of clustering on commonly joined fields.

Thanks again
Pat



On Mon, Mar 26, 2018 at 10:05 PM, Ted Yu <[email protected]> wrote:

> Hmm. Ryan seems to be right.
>
> Looking at sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/
> SupportsReportPartitioning.java :
>
> import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning;
> ...
>   Partitioning outputPartitioning();
>
> On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan <[email protected]> wrote:
>
>> Actually clustering is already supported, please take a look at
>> SupportsReportPartitioning
>>
>> Ordering is not proposed yet, might be similar to what Ryan proposed.
>>
>> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <[email protected]> wrote:
>>
>>> Interesting.
>>>
>>> Should requiredClustering return a Set of Expression's ?
>>> This way, we can determine the order of Expression's by looking at what 
>>> requiredOrdering()
>>> returns.
>>>
>>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <[email protected]>
>>> wrote:
>>>
>>>> Hi Pat,
>>>>
>>>> Thanks for starting the discussion on this, we’re really interested in
>>>> it as well. I don’t think there is a proposed API yet, but I was thinking
>>>> something like this:
>>>>
>>>> interface RequiresClustering {
>>>>   List<Expression> requiredClustering();
>>>> }
>>>>
>>>> interface RequiresSort {
>>>>   List<SortOrder> requiredOrdering();
>>>> }
>>>>
>>>> The reason why RequiresClustering should provide Expression is that it
>>>> needs to be able to customize the implementation. For example, writing to
>>>> HTable would require building a key (or the data for a key) and that might
>>>> use a hash function that differs from Spark’s built-ins. RequiresSort
>>>> is fairly straightforward, but the interaction between the two requirements
>>>> deserves some consideration. To make the two compatible, I think that
>>>> RequiresSort must be interpreted as a sort within each partition of
>>>> the clustering, but could possibly be used for a global sort when the two
>>>> overlap.
>>>>
>>>> For example, if I have a table partitioned by “day” and “category” then
>>>> the RequiredClustering would be by day, category. A required sort
>>>> might be day ASC, category DESC, name ASC. Because that sort satisfies
>>>> the required clustering, it could be used for a global ordering. But, is
>>>> that useful? How would the global ordering matter beyond a sort within each
>>>> partition, i.e., how would the partition’s place in the global ordering be
>>>> passed?
>>>>
>>>> To your other questions, you might want to have a look at the recent
>>>> SPIP I’m working on to consolidate and clean up logical plans
>>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>>> That proposes more specific uses for the DataSourceV2 API that should help
>>>> clarify what validation needs to take place. As for custom catalyst rules,
>>>> I’d like to hear about the use cases to see if we can build it into these
>>>> improvements.
>>>>
>>>> rb
>>>> 
>>>>
>>>> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <
>>>> [email protected]> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> I saw in some of the discussions around DataSourceV2 writes that we
>>>>> might have the data source inform Spark of requirements for the input
>>>>> data's ordering and partitioning. Has there been a proposed API for that
>>>>> yet?
>>>>>
>>>>> Even one level up it would be helpful to understand how I should be
>>>>> thinking about the responsibility of the data source writer, when I should
>>>>> be inserting a custom catalyst rule, and how I should handle
>>>>> validation/assumptions of the table before attempting the write.
>>>>>
>>>>> Thanks!
>>>>> Pat
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>

Re: DataSourceV2 write input requirements

Reply via email to