Re: DataSourceV2 write input requirements

2018-04-06 Thread Ryan Blue
russell.spit...@gmail.com>, Wenchen Fan < >> cloud0...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, Spark Dev List < >> dev@spark.apache.org> >> Subject: Re: DataSourceV2 write input requirements >> >> You're right. A global sort would change the c

Re: DataSourceV2 write input requirements

2018-04-01 Thread Patrick Woody
@gmail.com> > Cc: Russell Spitzer <russell.spit...@gmail.com>, Wenchen Fan < > cloud0...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, Spark Dev List < > dev@spark.apache.org> > Subject: Re: DataSourceV2 write input requirements > > You're right. A globa

Re: DataSourceV2 write input requirements

2018-03-30 Thread Ted Yu
l.com>, Spark Dev List <dev@spark.apache.org> Subject: Re: DataSourceV2 write input requirements You're right. A global sort would change the clustering if it had more fields than the clustering. Then what about this: if there is no RequiredClustering, then the sort is a global so

Re: DataSourceV2 write input requirements

2018-03-30 Thread Ryan Blue
You're right. A global sort would change the clustering if it had more fields than the clustering. Then what about this: if there is no RequiredClustering, then the sort is a global sort. If RequiredClustering is present, then the clustering is applied and the sort is a partition-level sort.

Re: DataSourceV2 write input requirements

2018-03-30 Thread Patrick Woody
Does that methodology work in this specific case? The ordering must be a subset of the clustering to guarantee they exist in the same partition when doing a global sort I thought. Though I get the gist that if it does satisfy, then there is no reason to not choose the global sort. On Fri, Mar 30,

Re: DataSourceV2 write input requirements

2018-03-30 Thread Ryan Blue
> Can you expand on how the ordering containing the clustering expressions would ensure the global sort? The idea was to basically assume that if the clustering can be satisfied by a global sort, then do the global sort. For example, if the clustering is Set("b", "a") and the sort is Seq("a",

Re: DataSourceV2 write input requirements

2018-03-30 Thread Patrick Woody
> > Right, you could use this to store a global ordering if there is only one > write (e.g., CTAS). I don’t think anything needs to change in that case, > you would still have a clustering and an ordering, but the ordering would > need to include all fields of the clustering. A way to pass in the

Re: DataSourceV2 write input requirements

2018-03-29 Thread Russell Spitzer
@RyanBlue I'm hoping that through the CBO effort we will continue to get more detailed statistics. Like on read we could be using sketch data structures to get estimates on unique values and density for each column. You may be right that the real way for this to be handled would be giving a "cost"

Re: DataSourceV2 write input requirements

2018-03-29 Thread Ryan Blue
Cassandra can insert records with the same partition-key faster if they arrive in the same payload. But this is only beneficial if the incoming dataset has multiple entries for the same partition key. Thanks for the example, the recommended partitioning use case makes more sense now. I think we

Re: DataSourceV2 write input requirements

2018-03-28 Thread Russell Spitzer
Ah yeah sorry I got a bit mixed up. On Wed, Mar 28, 2018 at 7:54 PM Ted Yu wrote: > bq. this shuffle could outweigh the benefits of the organized data if the > cardinality is lower. > > I wonder if you meant higher in place of the last word above. > > Cheers > > On Wed, Mar

Re: DataSourceV2 write input requirements

2018-03-28 Thread Ted Yu
bq. this shuffle could outweigh the benefits of the organized data if the cardinality is lower. I wonder if you meant higher in place of the last word above. Cheers On Wed, Mar 28, 2018 at 7:50 PM, Russell Spitzer wrote: > For added color, one thing that I may want

Re: DataSourceV2 write input requirements

2018-03-28 Thread Russell Spitzer
For added color, one thing that I may want to consider as a data source implementer is the cost / benefit of applying a particular clustering. For example, a dataset with low cardinality in the clustering key could benefit greatly from clustering on that key before writing to Cassandra since

Re: DataSourceV2 write input requirements

2018-03-28 Thread Patrick Woody
> > Spark would always apply the required clustering and sort order because > they are required by the data source. It is reasonable for a source to > reject data that isn’t properly prepared. For example, data must be written > to HTable files with keys in order or else the files are invalid.

Re: DataSourceV2 write input requirements

2018-03-28 Thread Ryan Blue
How would Spark determine whether or not to apply a recommendation - a cost threshold? Spark would always apply the required clustering and sort order because they are required by the data source. It is reasonable for a source to reject data that isn’t properly prepared. For example, data must be

Re: DataSourceV2 write input requirements

2018-03-28 Thread Patrick Woody
How would Spark determine whether or not to apply a recommendation - a cost threshold? And yes, it would be good to flesh out what information we get from Spark in the datasource when providing these recommendations/requirements - I could see statistics and the existing outputPartitioning/Ordering

Re: DataSourceV2 write input requirements

2018-03-27 Thread Russell Spitzer
Thanks for the clarification, definitely would want to require Sort but only recommend partitioning ... I think that would be useful to request based on details about the incoming dataset. On Tue, Mar 27, 2018 at 4:55 PM Ryan Blue wrote: > A required clustering would not,

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
A required clustering would not, but a required sort would. Clustering is asking for the input dataframe's partitioning, and sorting would be how each partition is sorted. On Tue, Mar 27, 2018 at 4:53 PM, Russell Spitzer wrote: > I forgot since it's been a while, but

Re: DataSourceV2 write input requirements

2018-03-27 Thread Russell Spitzer
I forgot since it's been a while, but does Clustering support allow requesting that partitions contain elements in order as well? That would be a useful trick for me. IE Request/Require(SortedOn(Col1)) Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2)) On Tue, Mar 27, 2018 at 4:38 PM

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
Thanks, it makes sense that the existing interface is for aggregation and not joins. Why are there requirements for the number of partitions that are returned then? Does it makes sense to design the write-side `Requirement` classes and the read-side reporting separately? On Tue, Mar 27, 2018 at

Re: DataSourceV2 write input requirements

2018-03-27 Thread Wenchen Fan
Hi Ryan, yea you are right that SupportsReportPartitioning doesn't expose hash function, so Join can't benefit from this interface, as Join doesn't require a general ClusteredDistribution, but a more specific one called HashClusteredDistribution. So currently only Aggregate can benefit from

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
I just took a look at SupportsReportPartitioning and I'm not sure that it will work for real use cases. It doesn't specify, as far as I can tell, a hash function for combining clusters into tasks or a way to provide Spark a hash function for the other side of a join. It seems unlikely to me that

Re: DataSourceV2 write input requirements

2018-03-26 Thread Wenchen Fan
Yea it is for read-side only. I think for the write-side, implementations can provide some options to allow users to set partitioning/ordering, or the data source has a natural partitioning/ordering which doesn't require any interface. On Mon, Mar 26, 2018 at 7:59 PM, Patrick Woody

Re: DataSourceV2 write input requirements

2018-03-26 Thread Patrick Woody
Hey Ryan, Ted, Wenchen Thanks for the quick replies. @Ryan - the sorting portion makes sense, but I think we'd have to ensure something similar to requiredChildDistribution in SparkPlan where we have the number of partitions as well if we'd want to further report to SupportsReportPartitioning,

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ted Yu
Hmm. Ryan seems to be right. Looking at sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java : import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning; ... Partitioning outputPartitioning(); On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ryan Blue
Wenchen, I thought SupportsReportPartitioning was for the read side. It works with the write side as well? On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan wrote: > Actually clustering is already supported, please take a look at > SupportsReportPartitioning > > Ordering is not

Re: DataSourceV2 write input requirements

2018-03-26 Thread Wenchen Fan
Actually clustering is already supported, please take a look at SupportsReportPartitioning Ordering is not proposed yet, might be similar to what Ryan proposed. On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu wrote: > Interesting. > > Should requiredClustering return a Set of

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ted Yu
Interesting. Should requiredClustering return a Set of Expression's ? This way, we can determine the order of Expression's by looking at what requiredOrdering() returns. On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue wrote: > Hi Pat, > > Thanks for starting the

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ryan Blue
Hi Pat, Thanks for starting the discussion on this, we’re really interested in it as well. I don’t think there is a proposed API yet, but I was thinking something like this: interface RequiresClustering { List requiredClustering(); } interface RequiresSort { List requiredOrdering(); } The

DataSourceV2 write input requirements

2018-03-26 Thread Patrick Woody
Hey all, I saw in some of the discussions around DataSourceV2 writes that we might have the data source inform Spark of requirements for the input data's ordering and partitioning. Has there been a proposed API for that yet? Even one level up it would be helpful to understand how I should be