Re: DataSourceV2 write input requirements

Patrick Woody Fri, 30 Mar 2018 12:39:53 -0700

Does that methodology work in this specific case? The ordering must be a
subset of the clustering to guarantee they exist in the same partition when
doing a global sort I thought. Though I get the gist that if it does
satisfy, then there is no reason to not choose the global sort.


On Fri, Mar 30, 2018 at 1:31 PM, Ryan Blue <rb...@netflix.com> wrote:

> > Can you expand on how the ordering containing the clustering expressions
> would ensure the global sort?
>
> The idea was to basically assume that if the clustering can be satisfied
> by a global sort, then do the global sort. For example, if the clustering
> is Set("b", "a") and the sort is Seq("a", "b", "c") then do a global sort
> by columns a, b, and c.
>
> Technically, you could do this with a hash partitioner instead of a range
> partitioner and sort within each partition, but that doesn't make much
> sense because the partitioning would ensure that each partition has just
> one combination of the required clustering columns. Using a hash
> partitioner would make it so that the in-partition sort basically ignores
> the first few values, so it must be that the intent was a global sort.
>
> On Fri, Mar 30, 2018 at 6:51 AM, Patrick Woody <patrick.woo...@gmail.com>
> wrote:
>
>> Right, you could use this to store a global ordering if there is only one
>>> write (e.g., CTAS). I don’t think anything needs to change in that case,
>>> you would still have a clustering and an ordering, but the ordering would
>>> need to include all fields of the clustering. A way to pass in the
>>> partition ordinal for the source to store would be required.
>>
>>
>> Can you expand on how the ordering containing the clustering expressions
>> would ensure the global sort? Having an RangePartitioning would certainly
>> satisfy, but it isn't required - is the suggestion that if Spark sees this
>> overlap, then it plans a global sort?
>>
>> On Thu, Mar 29, 2018 at 12:16 PM, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> @RyanBlue I'm hoping that through the CBO effort we will continue to get
>>> more detailed statistics. Like on read we could be using sketch data
>>> structures to get estimates on unique values and density for each column.
>>> You may be right that the real way for this to be handled would be giving a
>>> "cost" back to a higher order optimizer which can decide which method to
>>> use rather than having the data source itself do it. This is probably in a
>>> far future version of the api.
>>>
>>> On Thu, Mar 29, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Cassandra can insert records with the same partition-key faster if they
>>>> arrive in the same payload. But this is only beneficial if the incoming
>>>> dataset has multiple entries for the same partition key.
>>>>
>>>> Thanks for the example, the recommended partitioning use case makes
>>>> more sense now. I think we could have two interfaces, a
>>>> RequiresClustering and a RecommendsClustering if we want to support
>>>> this. But I’m skeptical it will be useful for two reasons:
>>>>
>>>>    - Do we want to optimize the low cardinality case? Shuffles are
>>>>    usually much cheaper at smaller sizes, so I’m not sure it is necessary 
>>>> to
>>>>    optimize this away.
>>>>    - How do we know there isn’t just a few partition keys for all the
>>>>    records? It may look like a shuffle wouldn’t help, but we don’t know the
>>>>    partition keys until it is too late.
>>>>
>>>> Then there’s also the logic for avoiding the shuffle and how to
>>>> calculate the cost, which sounds like something that needs some details
>>>> from CBO.
>>>>
>>>> I would assume that given the estimated data size from Spark and
>>>> options passed in from the user, the data source could make a more
>>>> intelligent requirement on the write format than Spark independently.
>>>>
>>>> This is a good point.
>>>>
>>>> What would an implementation actually do here and how would information
>>>> be passed? For my use cases, the store would produce the number of tasks
>>>> based on the estimated incoming rows, because the source has the best idea
>>>> of how the rows will compress. But, that’s just applying a multiplier most
>>>> of the time. To be very useful, this would have to handle skew in the rows
>>>> (think row with a type where total size depends on type) and that’s a bit
>>>> harder. I think maybe an interface that can provide relative cost estimates
>>>> based on partition keys would be helpful, but then keep the planning logic
>>>> in Spark.
>>>>
>>>> This is probably something that we could add later as we find use cases
>>>> that require it?
>>>>
>>>> I wouldn’t assume that a data source requiring a certain write format
>>>> would give any guarantees around reading the same data? In the cases where
>>>> it is a complete overwrite it would, but for independent writes it could
>>>> still be useful for statistics or compression.
>>>>
>>>> Right, you could use this to store a global ordering if there is only
>>>> one write (e.g., CTAS). I don’t think anything needs to change in that
>>>> case, you would still have a clustering and an ordering, but the ordering
>>>> would need to include all fields of the clustering. A way to pass in the
>>>> partition ordinal for the source to store would be required.
>>>>
>>>> For the second point that ordering is useful for statistics and
>>>> compression, I completely agree. Our best practices doc tells users to
>>>> always add a global sort when writing because you get the benefit of a
>>>> range partitioner to handle skew, plus the stats and compression you’re
>>>> talking about to optimize for reads. I think the proposed API can request a
>>>> global ordering from Spark already. My only point is that there isn’t much
>>>> the source can do to guarantee ordering for reads when there is more than
>>>> one write.
>>>> 
>>>>
>>>> On Wed, Mar 28, 2018 at 7:14 PM, Patrick Woody <
>>>> patrick.woo...@gmail.com> wrote:
>>>>
>>>>> Spark would always apply the required clustering and sort order
>>>>>> because they are required by the data source. It is reasonable for a 
>>>>>> source
>>>>>> to reject data that isn’t properly prepared. For example, data must be
>>>>>> written to HTable files with keys in order or else the files are invalid.
>>>>>> Sorting should not be implemented in the sources themselves because Spark
>>>>>> handles concerns like spilling to disk. Spark must prepare data 
>>>>>> correctly,
>>>>>> which is why the interfaces start with “Requires”.
>>>>>
>>>>>
>>>>> This was in reference to Russell's suggestion that the data source
>>>>> could have a required sort, but only a recommended partitioning. I don't
>>>>> have an immediate recommending use case that would come to mind though. 
>>>>> I'm
>>>>> definitely in sync that the data source itself shouldn't do work outside 
>>>>> of
>>>>> the writes themselves.
>>>>>
>>>>> Considering the second use case you mentioned first, I don’t think it
>>>>>> is a good idea for a table to put requirements on the number of tasks 
>>>>>> used
>>>>>> for a write. The parallelism should be set appropriately for the data
>>>>>> volume, which is for Spark or the user to determine. A minimum or maximum
>>>>>> number of tasks could cause bad behavior.
>>>>>
>>>>>
>>>>> For your first use case, an explicit global ordering, the problem is
>>>>>> that there can’t be an explicit global ordering for a table when it is
>>>>>> populated by a series of independent writes. Each write could have a 
>>>>>> global
>>>>>> order, but once those files are written, you have to deal with multiple
>>>>>> sorted data sets. I think it makes sense to focus on order within data
>>>>>> files, not order between data files.
>>>>>
>>>>>
>>>>> This is where I'm interested in learning about the separation of
>>>>> responsibilities for the data source and how "smart" it is supposed to be.
>>>>>
>>>>> For the first part, I would assume that given the estimated data size
>>>>> from Spark and options passed in from the user, the data source could make
>>>>> a more intelligent requirement on the write format than Spark
>>>>> independently. Somewhat analogous to how the current FileSource does bin
>>>>> packing of small files on the read side, restricting parallelism for the
>>>>> sake of overhead.
>>>>>
>>>>> For the second, I wouldn't assume that a data source requiring a
>>>>> certain write format would give any guarantees around reading the same
>>>>> data? In the cases where it is a complete overwrite it would, but for
>>>>> independent writes it could still be useful for statistics or compression.
>>>>>
>>>>> Thanks
>>>>> Pat
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 28, 2018 at 8:28 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> How would Spark determine whether or not to apply a recommendation -
>>>>>> a cost threshold?
>>>>>>
>>>>>> Spark would always apply the required clustering and sort order
>>>>>> because they are required by the data source. It is reasonable for a 
>>>>>> source
>>>>>> to reject data that isn’t properly prepared. For example, data must be
>>>>>> written to HTable files with keys in order or else the files are invalid.
>>>>>> Sorting should not be implemented in the sources themselves because Spark
>>>>>> handles concerns like spilling to disk. Spark must prepare data 
>>>>>> correctly,
>>>>>> which is why the interfaces start with “Requires”.
>>>>>>
>>>>>> I’m not sure what the second half of your question means. What does
>>>>>> Spark need to pass into the data source?
>>>>>>
>>>>>> Should a datasource be able to provide a Distribution proper rather
>>>>>> than just the clustering expressions? Two use cases would be for explicit
>>>>>> global sorting of the dataset and attempting to ensure a minimum write 
>>>>>> task
>>>>>> size/number of write tasks.
>>>>>>
>>>>>> Considering the second use case you mentioned first, I don’t think it
>>>>>> is a good idea for a table to put requirements on the number of tasks 
>>>>>> used
>>>>>> for a write. The parallelism should be set appropriately for the data
>>>>>> volume, which is for Spark or the user to determine. A minimum or maximum
>>>>>> number of tasks could cause bad behavior.
>>>>>>
>>>>>> That said, I think there is a related use case for sharding. But
>>>>>> that’s really just a clustering by an expression with the shard
>>>>>> calculation, e.g., hash(id_col, 64). The shards should be handled as
>>>>>> a cluster, but it doesn’t matter how many tasks are used for it.
>>>>>>
>>>>>> For your first use case, an explicit global ordering, the problem is
>>>>>> that there can’t be an explicit global ordering for a table when it is
>>>>>> populated by a series of independent writes. Each write could have a 
>>>>>> global
>>>>>> order, but once those files are written, you have to deal with multiple
>>>>>> sorted data sets. I think it makes sense to focus on order within data
>>>>>> files, not order between data files.
>>>>>> 
>>>>>>
>>>>>> On Wed, Mar 28, 2018 at 7:26 AM, Patrick Woody <
>>>>>> patrick.woo...@gmail.com> wrote:
>>>>>>
>>>>>>> How would Spark determine whether or not to apply a recommendation -
>>>>>>> a cost threshold? And yes, it would be good to flesh out what 
>>>>>>> information
>>>>>>> we get from Spark in the datasource when providing these
>>>>>>> recommendations/requirements - I could see statistics and the existing
>>>>>>> outputPartitioning/Ordering of the child plan being used for providing 
>>>>>>> the
>>>>>>> requirement.
>>>>>>>
>>>>>>> Should a datasource be able to provide a Distribution proper rather
>>>>>>> than just the clustering expressions? Two use cases would be for 
>>>>>>> explicit
>>>>>>> global sorting of the dataset and attempting to ensure a minimum write 
>>>>>>> task
>>>>>>> size/number of write tasks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 27, 2018 at 7:59 PM, Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks for the clarification, definitely would want to require Sort
>>>>>>>> but only recommend partitioning ...  I think that would be useful to
>>>>>>>> request based on details about the incoming dataset.
>>>>>>>>
>>>>>>>> On Tue, Mar 27, 2018 at 4:55 PM Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> A required clustering would not, but a required sort would.
>>>>>>>>> Clustering is asking for the input dataframe's partitioning, and 
>>>>>>>>> sorting
>>>>>>>>> would be how each partition is sorted.
>>>>>>>>>
>>>>>>>>> On Tue, Mar 27, 2018 at 4:53 PM, Russell Spitzer <
>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I forgot since it's been a while, but does Clustering support
>>>>>>>>>> allow requesting that partitions contain elements in order as well? 
>>>>>>>>>> That
>>>>>>>>>> would be a useful trick for me. IE
>>>>>>>>>> Request/Require(SortedOn(Col1))
>>>>>>>>>> Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2))
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 27, 2018 at 4:38 PM Ryan Blue
>>>>>>>>>> <rb...@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, it makes sense that the existing interface is for
>>>>>>>>>>> aggregation and not joins. Why are there requirements for the 
>>>>>>>>>>> number of
>>>>>>>>>>> partitions that are returned then?
>>>>>>>>>>>
>>>>>>>>>>> Does it makes sense to design the write-side `Requirement`
>>>>>>>>>>> classes and the read-side reporting separately?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 27, 2018 at 3:56 PM, Wenchen Fan <
>>>>>>>>>>> cloud0...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ryan, yea you are right that SupportsReportPartitioning
>>>>>>>>>>>> doesn't expose hash function, so Join can't benefit from this 
>>>>>>>>>>>> interface, as
>>>>>>>>>>>> Join doesn't require a general ClusteredDistribution, but a more 
>>>>>>>>>>>> specific
>>>>>>>>>>>> one called HashClusteredDistribution.
>>>>>>>>>>>>
>>>>>>>>>>>> So currently only Aggregate can benefit from
>>>>>>>>>>>> SupportsReportPartitioning and save shuffle. We can add a new 
>>>>>>>>>>>> interface to
>>>>>>>>>>>> expose the hash function to make it work for Join.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 27, 2018 at 9:33 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I just took a look at SupportsReportPartitioning and I'm not
>>>>>>>>>>>>> sure that it will work for real use cases. It doesn't specify, as 
>>>>>>>>>>>>> far as I
>>>>>>>>>>>>> can tell, a hash function for combining clusters into tasks or a 
>>>>>>>>>>>>> way to
>>>>>>>>>>>>> provide Spark a hash function for the other side of a join. It 
>>>>>>>>>>>>> seems
>>>>>>>>>>>>> unlikely to me that many data sources would have partitioning 
>>>>>>>>>>>>> that happens
>>>>>>>>>>>>> to match the other side of a join. And, it looks like task order 
>>>>>>>>>>>>> matters?
>>>>>>>>>>>>> Maybe I'm missing something?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think that we should design the write side independently
>>>>>>>>>>>>> based on what data stores actually need, and take a look at the 
>>>>>>>>>>>>> read side
>>>>>>>>>>>>> based on what data stores can actually provide. Wenchen, was 
>>>>>>>>>>>>> there a design
>>>>>>>>>>>>> doc for partitioning on the read path?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I completely agree with your point about a global sort. We
>>>>>>>>>>>>> recommend to all of our data engineers to add a sort to most 
>>>>>>>>>>>>> tables because
>>>>>>>>>>>>> it introduces the range partitioner and does a skew calculation, 
>>>>>>>>>>>>> in
>>>>>>>>>>>>> addition to making data filtering much better when it is read. 
>>>>>>>>>>>>> It's really
>>>>>>>>>>>>> common for tables to be skewed by partition values.
>>>>>>>>>>>>>
>>>>>>>>>>>>> rb
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 7:59 PM, Patrick Woody <
>>>>>>>>>>>>> patrick.woo...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Ryan, Ted, Wenchen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the quick replies.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Ryan - the sorting portion makes sense, but I think we'd
>>>>>>>>>>>>>> have to ensure something similar to requiredChildDistribution in 
>>>>>>>>>>>>>> SparkPlan
>>>>>>>>>>>>>> where we have the number of partitions as well if we'd want to 
>>>>>>>>>>>>>> further
>>>>>>>>>>>>>> report to SupportsReportPartitioning, yeah?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Specifying an explicit global sort can also be useful for
>>>>>>>>>>>>>> filtering purposes on Parquet row group stats if we have a time 
>>>>>>>>>>>>>> based/high
>>>>>>>>>>>>>> cardinality ID field. If my datasource or catalog knows about 
>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>> queries on a table, it could be really useful to recommend more 
>>>>>>>>>>>>>> appropriate
>>>>>>>>>>>>>> formatting for consumers on the next materialization. The same 
>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>> true of clustering on commonly joined fields.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again
>>>>>>>>>>>>>> Pat
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 10:05 PM, Ted Yu <yuzhih...@gmail.com
>>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hmm. Ryan seems to be right.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking at sql/core/src/main/java/org/
>>>>>>>>>>>>>>> apache/spark/sql/sources/v2/reader/SupportsReportPartitioning.java
>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> import org.apache.spark.sql.sources.v
>>>>>>>>>>>>>>> 2.reader.partitioning.Partitioning;
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>   Partitioning outputPartitioning();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan <
>>>>>>>>>>>>>>> cloud0...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Actually clustering is already supported, please take a
>>>>>>>>>>>>>>>> look at SupportsReportPartitioning
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ordering is not proposed yet, might be similar to what Ryan
>>>>>>>>>>>>>>>> proposed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <
>>>>>>>>>>>>>>>> yuzhih...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Interesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Should requiredClustering return a Set of Expression's ?
>>>>>>>>>>>>>>>>> This way, we can determine the order of Expression's by
>>>>>>>>>>>>>>>>> looking at what requiredOrdering() returns.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <
>>>>>>>>>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Pat,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for starting the discussion on this, we’re really
>>>>>>>>>>>>>>>>>> interested in it as well. I don’t think there is a proposed 
>>>>>>>>>>>>>>>>>> API yet, but I
>>>>>>>>>>>>>>>>>> was thinking something like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> interface RequiresClustering {
>>>>>>>>>>>>>>>>>>   List<Expression> requiredClustering();
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> interface RequiresSort {
>>>>>>>>>>>>>>>>>>   List<SortOrder> requiredOrdering();
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The reason why RequiresClustering should provide
>>>>>>>>>>>>>>>>>> Expression is that it needs to be able to customize the
>>>>>>>>>>>>>>>>>> implementation. For example, writing to HTable would require 
>>>>>>>>>>>>>>>>>> building a key
>>>>>>>>>>>>>>>>>> (or the data for a key) and that might use a hash function 
>>>>>>>>>>>>>>>>>> that differs
>>>>>>>>>>>>>>>>>> from Spark’s built-ins. RequiresSort is fairly
>>>>>>>>>>>>>>>>>> straightforward, but the interaction between the two 
>>>>>>>>>>>>>>>>>> requirements deserves
>>>>>>>>>>>>>>>>>> some consideration. To make the two compatible, I think that
>>>>>>>>>>>>>>>>>> RequiresSort must be interpreted as a sort within each
>>>>>>>>>>>>>>>>>> partition of the clustering, but could possibly be used for 
>>>>>>>>>>>>>>>>>> a global sort
>>>>>>>>>>>>>>>>>> when the two overlap.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For example, if I have a table partitioned by “day” and
>>>>>>>>>>>>>>>>>> “category” then the RequiredClustering would be by day,
>>>>>>>>>>>>>>>>>> category. A required sort might be day ASC, category
>>>>>>>>>>>>>>>>>> DESC, name ASC. Because that sort satisfies the required
>>>>>>>>>>>>>>>>>> clustering, it could be used for a global ordering. But, is 
>>>>>>>>>>>>>>>>>> that useful?
>>>>>>>>>>>>>>>>>> How would the global ordering matter beyond a sort within 
>>>>>>>>>>>>>>>>>> each partition,
>>>>>>>>>>>>>>>>>> i.e., how would the partition’s place in the global ordering 
>>>>>>>>>>>>>>>>>> be passed?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To your other questions, you might want to have a look at
>>>>>>>>>>>>>>>>>> the recent SPIP I’m working on to consolidate and clean
>>>>>>>>>>>>>>>>>> up logical plans
>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>>>>>>>>>>>>>>>>> That proposes more specific uses for the DataSourceV2 API 
>>>>>>>>>>>>>>>>>> that should help
>>>>>>>>>>>>>>>>>> clarify what validation needs to take place. As for custom 
>>>>>>>>>>>>>>>>>> catalyst rules,
>>>>>>>>>>>>>>>>>> I’d like to hear about the use cases to see if we can build 
>>>>>>>>>>>>>>>>>> it into these
>>>>>>>>>>>>>>>>>> improvements.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody <
>>>>>>>>>>>>>>>>>> patrick.woo...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I saw in some of the discussions around DataSourceV2
>>>>>>>>>>>>>>>>>>> writes that we might have the data source inform Spark of 
>>>>>>>>>>>>>>>>>>> requirements for
>>>>>>>>>>>>>>>>>>> the input data's ordering and partitioning. Has there been 
>>>>>>>>>>>>>>>>>>> a proposed API
>>>>>>>>>>>>>>>>>>> for that yet?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Even one level up it would be helpful to understand how
>>>>>>>>>>>>>>>>>>> I should be thinking about the responsibility of the data 
>>>>>>>>>>>>>>>>>>> source writer,
>>>>>>>>>>>>>>>>>>> when I should be inserting a custom catalyst rule, and how 
>>>>>>>>>>>>>>>>>>> I should handle
>>>>>>>>>>>>>>>>>>> validation/assumptions of the table before attempting the 
>>>>>>>>>>>>>>>>>>> write.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>> Pat
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 write input requirements

Reply via email to