Ah yes I agree. I was just saying it should be options (rather than specific constructs). Having them at creation time makes a lot of sense. Although one tricky thing is what if they need to change, but we can probably just special case that.
On Thu, Sep 21, 2017 at 6:28 PM Ryan Blue <rb...@netflix.com> wrote: > I’d just pass them [partitioning/bucketing] as options, until there are > clear (and strong) use cases to do them otherwise. > > I don’t think it makes sense to pass partitioning and bucketing > information *into* this API. The writer should already know the table > structure and should pass relevant information back out to Spark so it can > sort and group data for storage. > > I think the idea of passing the table structure into the writer comes from > the current implementation, where the table may not exist before a data > frame is written. But that isn’t something that should be carried forward. > I think the writer should be responsible for writing into an > already-configured table. That’s the normal case we should design for. > Creating a table at the same time (CTAS) is a convenience, but should be > implemented by creating an empty table and then running the same writer > that would have been used for an insert into an existing table. > > Otherwise, there’s confusion about how to handle the options. What should > the writer do when partitioning passed in doesn’t match the table’s > partitioning? We already have this situation in the DataFrameWriter API, > where calling partitionBy and then insertInto throws an exception. I’d > like to keep that case out of this API by setting the expectation that > tables this writes to already exist. > > rb > > > On Wed, Sep 20, 2017 at 9:52 AM, Reynold Xin <r...@databricks.com> wrote: > >> >> >> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Hi all, >>> >>> I want to have some discussion about Data Source V2 write path before >>> starting a voting. >>> >>> The Data Source V1 write path asks implementations to write a DataFrame >>> directly, which is painful: >>> 1. Exposing upper-level API like DataFrame to Data Source API is not >>> good for maintenance. >>> 2. Data sources may need to preprocess the input data before writing, >>> like cluster/sort the input by some columns. It's better to do the >>> preprocessing in Spark instead of in the data source. >>> 3. Data sources need to take care of transaction themselves, which is >>> hard. And different data sources may come up with a very similar approach >>> for the transaction, which leads to many duplicated codes. >>> >>> >>> To solve these pain points, I'm proposing a data source writing >>> framework which is very similar to the reading framework, i.e., >>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can take >>> a look at my prototype to see what it looks like: >>> https://github.com/apache/spark/pull/19269 >>> >>> There are some other details need further discussion: >>> 1. *partitioning/bucketing* >>> Currently only the built-in file-based data sources support them, but >>> there is nothing stopping us from exposing them to all data sources. One >>> question is, shall we make them as mix-in interfaces for data source v2 >>> reader/writer, or just encode them into data source options(a >>> string-to-string map)? Ideally it's more like options, Spark just transfers >>> these user-given informations to data sources, and doesn't do anything for >>> it. >>> >> >> >> I'd just pass them as options, until there are clear (and strong) use >> cases to do them otherwise. >> >> >> +1 on the rest. >> >> >> >>> >>> 2. *input data requirement* >>> Data sources should be able to ask Spark to preprocess the input data, >>> and this can be a mix-in interface for DataSourceV2Writer. I think we need >>> to add clustering request and sorting within partitions request, any more? >>> >>> 3. *transaction* >>> I think we can just follow `FileCommitProtocol`, which is the internal >>> framework Spark uses to guarantee transaction for built-in file-based data >>> sources. Generally speaking, we need task level and job level commit/abort. >>> Again you can see more details in my prototype about it: >>> https://github.com/apache/spark/pull/19269 >>> >>> 4. *data source table* >>> This is the trickiest one. In Spark you can create a table which points >>> to a data source, so you can read/write this data source easily by >>> referencing the table name. Ideally data source table is just a pointer >>> which points to a data source with a list of predefined options, to save >>> users from typing these options again and again for each query. >>> If that's all, then everything is good, we don't need to add more >>> interfaces to Data Source V2. However, data source tables provide special >>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data >>> sources to have some extra ability. >>> Currently these special operators only work for built-in file-based data >>> sources, and I don't think we will extend it in the near future, I propose >>> to mark them as out of the scope. >>> >>> >>> Any comments are welcome! >>> Thanks, >>> Wenchen >>> >> >> > > > -- > Ryan Blue > Software Engineer > Netflix >