Re: [discuss] Data Source V2 write path

Wenchen Fan Mon, 25 Sep 2017 06:24:07 -0700

We still need to support low-level data sources like pure parquet files,
which do not have a metastore.


BTW I think we should leave the metadata management to the catalog API
after catalog federation. Data source API should only care about data.

On Mon, Sep 25, 2017 at 11:14 AM, Reynold Xin <r...@databricks.com> wrote:

> Can there be an explicit create function?
>
>
> On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> I agree it would be a clean approach if data source is only responsible
>> to write into an already-configured table. However, without catalog
>> federation, Spark doesn't have an API to ask an external system(like
>> Cassandra) to create a table. Currently it's all done by data source write
>> API. Data source implementations are responsible to create or insert a
>> table according to the save mode.
>>
>> As a workaround, I think it's acceptable to pass partitioning/bucketing
>> information via data source options, and data sources should decide to take
>> these informations and create the table, or throw exception if these
>> informations don't match the already-configured table.
>>
>>
>> On Fri, Sep 22, 2017 at 9:35 AM, Ryan Blue <rb...@netflix.com> wrote:
>>
>>> > input data requirement
>>>
>>> Clustering and sorting within partitions are a good start. We can always
>>> add more later when they are needed.
>>>
>>> The primary use case I'm thinking of for this is partitioning and
>>> bucketing. If I'm implementing a partitioned table format, I need to tell
>>> Spark to cluster by my partition columns. Should there also be a way to
>>> pass those columns separately, since they may not be stored in the same way
>>> like partitions are in the current format?
>>>
>>> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I want to have some discussion about Data Source V2 write path before
>>>> starting a voting.
>>>>
>>>> The Data Source V1 write path asks implementations to write a DataFrame
>>>> directly, which is painful:
>>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>>> good for maintenance.
>>>> 2. Data sources may need to preprocess the input data before writing,
>>>> like cluster/sort the input by some columns. It's better to do the
>>>> preprocessing in Spark instead of in the data source.
>>>> 3. Data sources need to take care of transaction themselves, which is
>>>> hard. And different data sources may come up with a very similar approach
>>>> for the transaction, which leads to many duplicated codes.
>>>>
>>>>
>>>> To solve these pain points, I'm proposing a data source writing
>>>> framework which is very similar to the reading framework, i.e.,
>>>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can take
>>>> a look at my prototype to see what it looks like:
>>>> https://github.com/apache/spark/pull/19269
>>>>
>>>> There are some other details need further discussion:
>>>> 1. *partitioning/bucketing*
>>>> Currently only the built-in file-based data sources support them, but
>>>> there is nothing stopping us from exposing them to all data sources. One
>>>> question is, shall we make them as mix-in interfaces for data source v2
>>>> reader/writer, or just encode them into data source options(a
>>>> string-to-string map)? Ideally it's more like options, Spark just transfers
>>>> these user-given informations to data sources, and doesn't do anything for
>>>> it.
>>>>
>>>> 2. *input data requirement*
>>>> Data sources should be able to ask Spark to preprocess the input data,
>>>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>>>> to add clustering request and sorting within partitions request, any more?
>>>>
>>>> 3. *transaction*
>>>> I think we can just follow `FileCommitProtocol`, which is the internal
>>>> framework Spark uses to guarantee transaction for built-in file-based data
>>>> sources. Generally speaking, we need task level and job level commit/abort.
>>>> Again you can see more details in my prototype about it:
>>>> https://github.com/apache/spark/pull/19269
>>>>
>>>> 4. *data source table*
>>>> This is the trickiest one. In Spark you can create a table which points
>>>> to a data source, so you can read/write this data source easily by
>>>> referencing the table name. Ideally data source table is just a pointer
>>>> which points to a data source with a list of predefined options, to save
>>>> users from typing these options again and again for each query.
>>>> If that's all, then everything is good, we don't need to add more
>>>> interfaces to Data Source V2. However, data source tables provide special
>>>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>>>> sources to have some extra ability.
>>>> Currently these special operators only work for built-in file-based
>>>> data sources, and I don't think we will extend it in the near future, I
>>>> propose to mark them as out of the scope.
>>>>
>>>>
>>>> Any comments are welcome!
>>>> Thanks,
>>>> Wenchen
>>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>

Re: [discuss] Data Source V2 write path

Reply via email to