Re: [discuss] Data Source V2 write path

Reynold Xin Thu, 21 Sep 2017 18:32:22 -0700

Ah yes I agree. I was just saying it should be options (rather than
specific constructs). Having them at creation time makes a lot of sense.
Although one tricky thing is what if they need to change, but we can
probably just special case that.


On Thu, Sep 21, 2017 at 6:28 PM Ryan Blue <rb...@netflix.com> wrote:

> I’d just pass them [partitioning/bucketing] as options, until there are
> clear (and strong) use cases to do them otherwise.
>
> I don’t think it makes sense to pass partitioning and bucketing
> information *into* this API. The writer should already know the table
> structure and should pass relevant information back out to Spark so it can
> sort and group data for storage.
>
> I think the idea of passing the table structure into the writer comes from
> the current implementation, where the table may not exist before a data
> frame is written. But that isn’t something that should be carried forward.
> I think the writer should be responsible for writing into an
> already-configured table. That’s the normal case we should design for.
> Creating a table at the same time (CTAS) is a convenience, but should be
> implemented by creating an empty table and then running the same writer
> that would have been used for an insert into an existing table.
>
> Otherwise, there’s confusion about how to handle the options. What should
> the writer do when partitioning passed in doesn’t match the table’s
> partitioning? We already have this situation in the DataFrameWriter API,
> where calling partitionBy and then insertInto throws an exception. I’d
> like to keep that case out of this API by setting the expectation that
> tables this writes to already exist.
>
> rb
> 
>
> On Wed, Sep 20, 2017 at 9:52 AM, Reynold Xin <r...@databricks.com> wrote:
>
>>
>>
>> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I want to have some discussion about Data Source V2 write path before
>>> starting a voting.
>>>
>>> The Data Source V1 write path asks implementations to write a DataFrame
>>> directly, which is painful:
>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>> good for maintenance.
>>> 2. Data sources may need to preprocess the input data before writing,
>>> like cluster/sort the input by some columns. It's better to do the
>>> preprocessing in Spark instead of in the data source.
>>> 3. Data sources need to take care of transaction themselves, which is
>>> hard. And different data sources may come up with a very similar approach
>>> for the transaction, which leads to many duplicated codes.
>>>
>>>
>>> To solve these pain points, I'm proposing a data source writing
>>> framework which is very similar to the reading framework, i.e.,
>>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can take
>>> a look at my prototype to see what it looks like:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> There are some other details need further discussion:
>>> 1. *partitioning/bucketing*
>>> Currently only the built-in file-based data sources support them, but
>>> there is nothing stopping us from exposing them to all data sources. One
>>> question is, shall we make them as mix-in interfaces for data source v2
>>> reader/writer, or just encode them into data source options(a
>>> string-to-string map)? Ideally it's more like options, Spark just transfers
>>> these user-given informations to data sources, and doesn't do anything for
>>> it.
>>>
>>
>>
>> I'd just pass them as options, until there are clear (and strong) use
>> cases to do them otherwise.
>>
>>
>> +1 on the rest.
>>
>>
>>
>>>
>>> 2. *input data requirement*
>>> Data sources should be able to ask Spark to preprocess the input data,
>>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>>> to add clustering request and sorting within partitions request, any more?
>>>
>>> 3. *transaction*
>>> I think we can just follow `FileCommitProtocol`, which is the internal
>>> framework Spark uses to guarantee transaction for built-in file-based data
>>> sources. Generally speaking, we need task level and job level commit/abort.
>>> Again you can see more details in my prototype about it:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> 4. *data source table*
>>> This is the trickiest one. In Spark you can create a table which points
>>> to a data source, so you can read/write this data source easily by
>>> referencing the table name. Ideally data source table is just a pointer
>>> which points to a data source with a list of predefined options, to save
>>> users from typing these options again and again for each query.
>>> If that's all, then everything is good, we don't need to add more
>>> interfaces to Data Source V2. However, data source tables provide special
>>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>>> sources to have some extra ability.
>>> Currently these special operators only work for built-in file-based data
>>> sources, and I don't think we will extend it in the near future, I propose
>>> to mark them as out of the scope.
>>>
>>>
>>> Any comments are welcome!
>>> Thanks,
>>> Wenchen
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [discuss] Data Source V2 write path

Reply via email to