This vote passes with 3 binding +1 votes, 5 non-binding votes, and no -1 votes.
Thanks all! +1 votes (binding): Wenchen Fan Reynold Xin Cheng Liang +1 votes (non-binding): Xiao Li Weichen Xu Vaquar khan Liwei Lin Dongjoon Hyun On Tue, Oct 17, 2017 at 12:30 AM, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > +1 > > On Sun, Oct 15, 2017 at 11:43 PM, Cheng Lian <lian.cs....@gmail.com> > wrote: > >> +1 >> >> On 10/12/17 20:10, Liwei Lin wrote: >> >> +1 ! >> >> Cheers, >> Liwei >> >> On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan <vaquar.k...@gmail.com> >> wrote: >> >>> +1 >>> >>> Regards, >>> Vaquar khan >>> >>> On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen...@databricks.com> >>> wrote: >>> >>> +1 >>> >>> On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li <gatorsm...@gmail.com> wrote: >>> >>>> +1 >>>> >>>> Xiao >>>> >>>> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin <r...@databricks.com> wrote: >>>> >>>>> +1 >>>>> >>>>> One thing with MetadataSupport - It's a bad idea to call it that >>>>> unless adding new functions in that trait wouldn't break source/binary >>>>> compatibility in the future. >>>>> >>>>> >>>>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> I'm adding my own +1 (binding). >>>>>> >>>>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I'm going to update the proposal: for the last point, although the >>>>>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`) >>>>>>> mixes data and metadata operations, we are still able to separate them >>>>>>> in >>>>>>> the data source write API. We can have a mix-in trait `MetadataSupport` >>>>>>> which has a method `create(options)`, so that data sources can mix in >>>>>>> this >>>>>>> trait and provide metadata creation support. Spark will call this >>>>>>> `create` >>>>>>> method inside `DataFrameWriter.save` if the specified data source has >>>>>>> it. >>>>>>> >>>>>>> Note that file format data sources can ignore this new trait and >>>>>>> still write data without metadata(it doesn't have metadata anyway). >>>>>>> >>>>>>> With this updated proposal, I'm calling a new vote for the data >>>>>>> source v2 write path. >>>>>>> >>>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>>> vote: >>>>>>> >>>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>>> +0: Don't really care. >>>>>>> -1: I don't think this is a good idea because of the following >>>>>>> technical reasons. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> After we merge the infrastructure of data source v2 read path, and >>>>>>>> have some discussion for the write path, now I'm sending this email to >>>>>>>> call >>>>>>>> a vote for Data Source v2 write path. >>>>>>>> >>>>>>>> The full document of the Data Source API V2 is: >>>>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >>>>>>>> -Z8qU5Frf6WMQZ6jJVM/edit >>>>>>>> >>>>>>>> The ready-for-review PR that implements the basic infrastructure >>>>>>>> for the write path: >>>>>>>> https://github.com/apache/spark/pull/19269 >>>>>>>> >>>>>>>> >>>>>>>> The Data Source V1 write path asks implementations to write a >>>>>>>> DataFrame directly, which is painful: >>>>>>>> 1. Exposing upper-level API like DataFrame to Data Source API is >>>>>>>> not good for maintenance. >>>>>>>> 2. Data sources may need to preprocess the input data before >>>>>>>> writing, like cluster/sort the input by some columns. It's better to >>>>>>>> do the >>>>>>>> preprocessing in Spark instead of in the data source. >>>>>>>> 3. Data sources need to take care of transaction themselves, which >>>>>>>> is hard. And different data sources may come up with a very similar >>>>>>>> approach for the transaction, which leads to many duplicated codes. >>>>>>>> >>>>>>>> To solve these pain points, I'm proposing the data source v2 >>>>>>>> writing framework which is very similar to the reading framework, i.e., >>>>>>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter. >>>>>>>> >>>>>>>> Data Source V2 write path follows the existing FileCommitProtocol, >>>>>>>> and have task/job level commit/abort, so that data sources can >>>>>>>> implement >>>>>>>> transaction easier. >>>>>>>> >>>>>>>> We can create a mix-in trait for DataSourceV2Writer to specify the >>>>>>>> requirement for input data, like clustering and ordering. >>>>>>>> >>>>>>>> Spark provides a very simple protocol for uses to connect to data >>>>>>>> sources. A common way to write a dataframe to data sources: >>>>>>>> `df.write.format(...).option(...).mode(...).save()`. >>>>>>>> Spark passes the options and save mode to data sources, and >>>>>>>> schedules the write job on the input data. And the data source should >>>>>>>> take >>>>>>>> care of the metadata, e.g., the JDBC data source can create the table >>>>>>>> if it >>>>>>>> doesn't exist, or fail the job and ask users to create the table in the >>>>>>>> corresponding database first. Data sources can define some options for >>>>>>>> users to carry some metadata information like partitioning/bucketing. >>>>>>>> >>>>>>>> >>>>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>>>> vote: >>>>>>>> >>>>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>>>> +0: Don't really care. >>>>>>>> -1: I don't think this is a good idea because of the following >>>>>>>> technical reasons. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>> >>> >> >> >