russell.spit...@gmail.com>, Wenchen Fan <
>> cloud0...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, Spark Dev List <
>> dev@spark.apache.org>
>> Subject: Re: DataSourceV2 write input requirements
>>
>> You're right. A global sort would change the c
@gmail.com>
> Cc: Russell Spitzer <russell.spit...@gmail.com>, Wenchen Fan <
> cloud0...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, Spark Dev List <
> dev@spark.apache.org>
> Subject: Re: DataSourceV2 write input requirements
>
> You're right. A globa
l.com>, Spark Dev List <dev@spark.apache.org> Subject:
Re: DataSourceV2 write input requirements
You're right. A global sort would change the clustering if it had more fields
than the clustering.
Then what about this: if there is no RequiredClustering, then the sort is a
global so
You're right. A global sort would change the clustering if it had more
fields than the clustering.
Then what about this: if there is no RequiredClustering, then the sort is a
global sort. If RequiredClustering is present, then the clustering is
applied and the sort is a partition-level sort.
Does that methodology work in this specific case? The ordering must be a
subset of the clustering to guarantee they exist in the same partition when
doing a global sort I thought. Though I get the gist that if it does
satisfy, then there is no reason to not choose the global sort.
On Fri, Mar 30,
> Can you expand on how the ordering containing the clustering expressions
would ensure the global sort?
The idea was to basically assume that if the clustering can be satisfied by
a global sort, then do the global sort. For example, if the clustering is
Set("b", "a") and the sort is Seq("a",
>
> Right, you could use this to store a global ordering if there is only one
> write (e.g., CTAS). I don’t think anything needs to change in that case,
> you would still have a clustering and an ordering, but the ordering would
> need to include all fields of the clustering. A way to pass in the
@RyanBlue I'm hoping that through the CBO effort we will continue to get
more detailed statistics. Like on read we could be using sketch data
structures to get estimates on unique values and density for each column.
You may be right that the real way for this to be handled would be giving a
"cost"
Cassandra can insert records with the same partition-key faster if they
arrive in the same payload. But this is only beneficial if the incoming
dataset has multiple entries for the same partition key.
Thanks for the example, the recommended partitioning use case makes more
sense now. I think we
Ah yeah sorry I got a bit mixed up.
On Wed, Mar 28, 2018 at 7:54 PM Ted Yu wrote:
> bq. this shuffle could outweigh the benefits of the organized data if the
> cardinality is lower.
>
> I wonder if you meant higher in place of the last word above.
>
> Cheers
>
> On Wed, Mar
bq. this shuffle could outweigh the benefits of the organized data if the
cardinality is lower.
I wonder if you meant higher in place of the last word above.
Cheers
On Wed, Mar 28, 2018 at 7:50 PM, Russell Spitzer
wrote:
> For added color, one thing that I may want
For added color, one thing that I may want to consider as a data source
implementer is the cost / benefit of applying a particular clustering. For
example, a dataset with low cardinality in the clustering key could benefit
greatly from clustering on that key before writing to Cassandra since
>
> Spark would always apply the required clustering and sort order because
> they are required by the data source. It is reasonable for a source to
> reject data that isn’t properly prepared. For example, data must be written
> to HTable files with keys in order or else the files are invalid.
How would Spark determine whether or not to apply a recommendation - a cost
threshold?
Spark would always apply the required clustering and sort order because
they are required by the data source. It is reasonable for a source to
reject data that isn’t properly prepared. For example, data must be
How would Spark determine whether or not to apply a recommendation - a cost
threshold? And yes, it would be good to flesh out what information we get
from Spark in the datasource when providing these
recommendations/requirements - I could see statistics and the existing
outputPartitioning/Ordering
Thanks for the clarification, definitely would want to require Sort but
only recommend partitioning ... I think that would be useful to request
based on details about the incoming dataset.
On Tue, Mar 27, 2018 at 4:55 PM Ryan Blue wrote:
> A required clustering would not,
A required clustering would not, but a required sort would. Clustering is
asking for the input dataframe's partitioning, and sorting would be how
each partition is sorted.
On Tue, Mar 27, 2018 at 4:53 PM, Russell Spitzer
wrote:
> I forgot since it's been a while, but
I forgot since it's been a while, but does Clustering support allow
requesting that partitions contain elements in order as well? That would be
a useful trick for me. IE
Request/Require(SortedOn(Col1))
Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2))
On Tue, Mar 27, 2018 at 4:38 PM
Thanks, it makes sense that the existing interface is for aggregation and
not joins. Why are there requirements for the number of partitions that are
returned then?
Does it makes sense to design the write-side `Requirement` classes and the
read-side reporting separately?
On Tue, Mar 27, 2018 at
Hi Ryan, yea you are right that SupportsReportPartitioning doesn't expose
hash function, so Join can't benefit from this interface, as Join doesn't
require a general ClusteredDistribution, but a more specific one
called HashClusteredDistribution.
So currently only Aggregate can benefit from
I just took a look at SupportsReportPartitioning and I'm not sure that it
will work for real use cases. It doesn't specify, as far as I can tell, a
hash function for combining clusters into tasks or a way to provide Spark a
hash function for the other side of a join. It seems unlikely to me that
Yea it is for read-side only. I think for the write-side, implementations
can provide some options to allow users to set partitioning/ordering, or
the data source has a natural partitioning/ordering which doesn't require
any interface.
On Mon, Mar 26, 2018 at 7:59 PM, Patrick Woody
Hey Ryan, Ted, Wenchen
Thanks for the quick replies.
@Ryan - the sorting portion makes sense, but I think we'd have to ensure
something similar to requiredChildDistribution in SparkPlan where we have
the number of partitions as well if we'd want to further report to
SupportsReportPartitioning,
Hmm. Ryan seems to be right.
Looking
at
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java
:
import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning;
...
Partitioning outputPartitioning();
On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan
Wenchen, I thought SupportsReportPartitioning was for the read side. It
works with the write side as well?
On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan wrote:
> Actually clustering is already supported, please take a look at
> SupportsReportPartitioning
>
> Ordering is not
Actually clustering is already supported, please take a look at
SupportsReportPartitioning
Ordering is not proposed yet, might be similar to what Ryan proposed.
On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu wrote:
> Interesting.
>
> Should requiredClustering return a Set of
Interesting.
Should requiredClustering return a Set of Expression's ?
This way, we can determine the order of Expression's by looking at
what requiredOrdering()
returns.
On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue
wrote:
> Hi Pat,
>
> Thanks for starting the
Hi Pat,
Thanks for starting the discussion on this, we’re really interested in it
as well. I don’t think there is a proposed API yet, but I was thinking
something like this:
interface RequiresClustering {
List requiredClustering();
}
interface RequiresSort {
List requiredOrdering();
}
The
Hey all,
I saw in some of the discussions around DataSourceV2 writes that we might
have the data source inform Spark of requirements for the input data's
ordering and partitioning. Has there been a proposed API for that yet?
Even one level up it would be helpful to understand how I should be
29 matches
Mail list logo