Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

vaquar khan Thu, 12 Oct 2017 04:12:34 -0700

+1

Regards,
Vaquar khan


On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen...@databricks.com> wrote:

+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> +1
>
> Xiao
>
> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin <r...@databricks.com> wrote:
>
>> +1
>>
>> One thing with MetadataSupport - It's a bad idea to call it that unless
>> adding new functions in that trait wouldn't break source/binary
>> compatibility in the future.
>>
>>
>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> I'm adding my own +1 (binding).
>>>
>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> I'm going to update the proposal: for the last point, although the
>>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
>>>> mixes data and metadata operations, we are still able to separate them in
>>>> the data source write API. We can have a mix-in trait `MetadataSupport`
>>>> which has a method `create(options)`, so that data sources can mix in this
>>>> trait and provide metadata creation support. Spark will call this `create`
>>>> method inside `DataFrameWriter.save` if the specified data source has it.
>>>>
>>>> Note that file format data sources can ignore this new trait and still
>>>> write data without metadata(it doesn't have metadata anyway).
>>>>
>>>> With this updated proposal, I'm calling a new vote for the data source
>>>> v2 write path.
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following
>>>> technical reasons.
>>>>
>>>> Thanks!
>>>>
>>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> After we merge the infrastructure of data source v2 read path, and
>>>>> have some discussion for the write path, now I'm sending this email to 
>>>>> call
>>>>> a vote for Data Source v2 write path.
>>>>>
>>>>> The full document of the Data Source API V2 is:
>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>>>
>>>>> The ready-for-review PR that implements the basic infrastructure for
>>>>> the write path:
>>>>> https://github.com/apache/spark/pull/19269
>>>>>
>>>>>
>>>>> The Data Source V1 write path asks implementations to write a
>>>>> DataFrame directly, which is painful:
>>>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>>>> good for maintenance.
>>>>> 2. Data sources may need to preprocess the input data before writing,
>>>>> like cluster/sort the input by some columns. It's better to do the
>>>>> preprocessing in Spark instead of in the data source.
>>>>> 3. Data sources need to take care of transaction themselves, which is
>>>>> hard. And different data sources may come up with a very similar approach
>>>>> for the transaction, which leads to many duplicated codes.
>>>>>
>>>>> To solve these pain points, I'm proposing the data source v2 writing
>>>>> framework which is very similar to the reading framework, i.e.,
>>>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>>>>
>>>>> Data Source V2 write path follows the existing FileCommitProtocol, and
>>>>> have task/job level commit/abort, so that data sources can implement
>>>>> transaction easier.
>>>>>
>>>>> We can create a mix-in trait for DataSourceV2Writer to specify the
>>>>> requirement for input data, like clustering and ordering.
>>>>>
>>>>> Spark provides a very simple protocol for uses to connect to data
>>>>> sources. A common way to write a dataframe to data sources:
>>>>> `df.write.format(...).option(...).mode(...).save()`.
>>>>> Spark passes the options and save mode to data sources, and schedules
>>>>> the write job on the input data. And the data source should take care of
>>>>> the metadata, e.g., the JDBC data source can create the table if it 
>>>>> doesn't
>>>>> exist, or fail the job and ask users to create the table in the
>>>>> corresponding database first. Data sources can define some options for
>>>>> users to carry some metadata information like partitioning/bucketing.
>>>>>
>>>>>
>>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>>
>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>> +0: Don't really care.
>>>>> -1: I don't think this is a good idea because of the following
>>>>> technical reasons.
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>>
>>>
>>

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Reply via email to