Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Wenchen Fan Mon, 16 Oct 2017 10:13:24 -0700

This vote passes with 3 binding +1 votes, 5 non-binding votes, and no -1
votes.


Thanks all!

+1 votes (binding):
Wenchen Fan
Reynold Xin
Cheng Liang


+1 votes (non-binding):
Xiao Li
Weichen Xu
Vaquar khan
Liwei Lin
Dongjoon Hyun


On Tue, Oct 17, 2017 at 12:30 AM, Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> +1
>
> On Sun, Oct 15, 2017 at 11:43 PM, Cheng Lian <lian.cs....@gmail.com>
> wrote:
>
>> +1
>>
>> On 10/12/17 20:10, Liwei Lin wrote:
>>
>> +1 !
>>
>> Cheers,
>> Liwei
>>
>> On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan <vaquar.k...@gmail.com>
>> wrote:
>>
>>> +1
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen...@databricks.com>
>>> wrote:
>>>
>>> +1
>>>
>>> On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> Xiao
>>>>
>>>> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> One thing with MetadataSupport - It's a bad idea to call it that
>>>>> unless adding new functions in that trait wouldn't break source/binary
>>>>> compatibility in the future.
>>>>>
>>>>>
>>>>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan <cloud0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm adding my own +1 (binding).
>>>>>>
>>>>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm going to update the proposal: for the last point, although the
>>>>>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
>>>>>>> mixes data and metadata operations, we are still able to separate them 
>>>>>>> in
>>>>>>> the data source write API. We can have a mix-in trait `MetadataSupport`
>>>>>>> which has a method `create(options)`, so that data sources can mix in 
>>>>>>> this
>>>>>>> trait and provide metadata creation support. Spark will call this 
>>>>>>> `create`
>>>>>>> method inside `DataFrameWriter.save` if the specified data source has 
>>>>>>> it.
>>>>>>>
>>>>>>> Note that file format data sources can ignore this new trait and
>>>>>>> still write data without metadata(it doesn't have metadata anyway).
>>>>>>>
>>>>>>> With this updated proposal, I'm calling a new vote for the data
>>>>>>> source v2 write path.
>>>>>>>
>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>> vote:
>>>>>>>
>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>> +0: Don't really care.
>>>>>>> -1: I don't think this is a good idea because of the following
>>>>>>> technical reasons.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> After we merge the infrastructure of data source v2 read path, and
>>>>>>>> have some discussion for the write path, now I'm sending this email to 
>>>>>>>> call
>>>>>>>> a vote for Data Source v2 write path.
>>>>>>>>
>>>>>>>> The full document of the Data Source API V2 is:
>>>>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>>>>>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>>>
>>>>>>>> The ready-for-review PR that implements the basic infrastructure
>>>>>>>> for the write path:
>>>>>>>> https://github.com/apache/spark/pull/19269
>>>>>>>>
>>>>>>>>
>>>>>>>> The Data Source V1 write path asks implementations to write a
>>>>>>>> DataFrame directly, which is painful:
>>>>>>>> 1. Exposing upper-level API like DataFrame to Data Source API is
>>>>>>>> not good for maintenance.
>>>>>>>> 2. Data sources may need to preprocess the input data before
>>>>>>>> writing, like cluster/sort the input by some columns. It's better to 
>>>>>>>> do the
>>>>>>>> preprocessing in Spark instead of in the data source.
>>>>>>>> 3. Data sources need to take care of transaction themselves, which
>>>>>>>> is hard. And different data sources may come up with a very similar
>>>>>>>> approach for the transaction, which leads to many duplicated codes.
>>>>>>>>
>>>>>>>> To solve these pain points, I'm proposing the data source v2
>>>>>>>> writing framework which is very similar to the reading framework, i.e.,
>>>>>>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>>>>>>>
>>>>>>>> Data Source V2 write path follows the existing FileCommitProtocol,
>>>>>>>> and have task/job level commit/abort, so that data sources can 
>>>>>>>> implement
>>>>>>>> transaction easier.
>>>>>>>>
>>>>>>>> We can create a mix-in trait for DataSourceV2Writer to specify the
>>>>>>>> requirement for input data, like clustering and ordering.
>>>>>>>>
>>>>>>>> Spark provides a very simple protocol for uses to connect to data
>>>>>>>> sources. A common way to write a dataframe to data sources:
>>>>>>>> `df.write.format(...).option(...).mode(...).save()`.
>>>>>>>> Spark passes the options and save mode to data sources, and
>>>>>>>> schedules the write job on the input data. And the data source should 
>>>>>>>> take
>>>>>>>> care of the metadata, e.g., the JDBC data source can create the table 
>>>>>>>> if it
>>>>>>>> doesn't exist, or fail the job and ask users to create the table in the
>>>>>>>> corresponding database first. Data sources can define some options for
>>>>>>>> users to carry some metadata information like partitioning/bucketing.
>>>>>>>>
>>>>>>>>
>>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>>> vote:
>>>>>>>>
>>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>> +0: Don't really care.
>>>>>>>> -1: I don't think this is a good idea because of the following
>>>>>>>> technical reasons.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>>
>

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Reply via email to