> input data requirement Clustering and sorting within partitions are a good start. We can always add more later when they are needed.
The primary use case I'm thinking of for this is partitioning and bucketing. If I'm implementing a partitioned table format, I need to tell Spark to cluster by my partition columns. Should there also be a way to pass those columns separately, since they may not be stored in the same way like partitions are in the current format? On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi all, > > I want to have some discussion about Data Source V2 write path before > starting a voting. > > The Data Source V1 write path asks implementations to write a DataFrame > directly, which is painful: > 1. Exposing upper-level API like DataFrame to Data Source API is not good > for maintenance. > 2. Data sources may need to preprocess the input data before writing, like > cluster/sort the input by some columns. It's better to do the preprocessing > in Spark instead of in the data source. > 3. Data sources need to take care of transaction themselves, which is > hard. And different data sources may come up with a very similar approach > for the transaction, which leads to many duplicated codes. > > > To solve these pain points, I'm proposing a data source writing framework > which is very similar to the reading framework, i.e., WriteSupport -> > DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my > prototype to see what it looks like: https://github.com/ > apache/spark/pull/19269 > > There are some other details need further discussion: > 1. *partitioning/bucketing* > Currently only the built-in file-based data sources support them, but > there is nothing stopping us from exposing them to all data sources. One > question is, shall we make them as mix-in interfaces for data source v2 > reader/writer, or just encode them into data source options(a > string-to-string map)? Ideally it's more like options, Spark just transfers > these user-given informations to data sources, and doesn't do anything for > it. > > 2. *input data requirement* > Data sources should be able to ask Spark to preprocess the input data, and > this can be a mix-in interface for DataSourceV2Writer. I think we need to > add clustering request and sorting within partitions request, any more? > > 3. *transaction* > I think we can just follow `FileCommitProtocol`, which is the internal > framework Spark uses to guarantee transaction for built-in file-based data > sources. Generally speaking, we need task level and job level commit/abort. > Again you can see more details in my prototype about it: > https://github.com/apache/spark/pull/19269 > > 4. *data source table* > This is the trickiest one. In Spark you can create a table which points to > a data source, so you can read/write this data source easily by referencing > the table name. Ideally data source table is just a pointer which points to > a data source with a list of predefined options, to save users from typing > these options again and again for each query. > If that's all, then everything is good, we don't need to add more > interfaces to Data Source V2. However, data source tables provide special > operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data > sources to have some extra ability. > Currently these special operators only work for built-in file-based data > sources, and I don't think we will extend it in the near future, I propose > to mark them as out of the scope. > > > Any comments are welcome! > Thanks, > Wenchen > -- Ryan Blue Software Engineer Netflix