+1 Xiao On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin <r...@databricks.com> wrote:
> +1 > > One thing with MetadataSupport - It's a bad idea to call it that unless > adding new functions in that trait wouldn't break source/binary > compatibility in the future. > > > On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> I'm adding my own +1 (binding). >> >> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> I'm going to update the proposal: for the last point, although the >>> user-facing API (`df.write.format(...).option(...).mode(...).save()`) mixes >>> data and metadata operations, we are still able to separate them in the >>> data source write API. We can have a mix-in trait `MetadataSupport` which >>> has a method `create(options)`, so that data sources can mix in this trait >>> and provide metadata creation support. Spark will call this `create` method >>> inside `DataFrameWriter.save` if the specified data source has it. >>> >>> Note that file format data sources can ignore this new trait and still >>> write data without metadata(it doesn't have metadata anyway). >>> >>> With this updated proposal, I'm calling a new vote for the data source >>> v2 write path. >>> >>> The vote will be up for the next 72 hours. Please reply with your vote: >>> >>> +1: Yeah, let's go forward and implement the SPIP. >>> +0: Don't really care. >>> -1: I don't think this is a good idea because of the following technical >>> reasons. >>> >>> Thanks! >>> >>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> After we merge the infrastructure of data source v2 read path, and have >>>> some discussion for the write path, now I'm sending this email to call a >>>> vote for Data Source v2 write path. >>>> >>>> The full document of the Data Source API V2 is: >>>> >>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit >>>> >>>> The ready-for-review PR that implements the basic infrastructure for >>>> the write path: >>>> https://github.com/apache/spark/pull/19269 >>>> >>>> >>>> The Data Source V1 write path asks implementations to write a DataFrame >>>> directly, which is painful: >>>> 1. Exposing upper-level API like DataFrame to Data Source API is not >>>> good for maintenance. >>>> 2. Data sources may need to preprocess the input data before writing, >>>> like cluster/sort the input by some columns. It's better to do the >>>> preprocessing in Spark instead of in the data source. >>>> 3. Data sources need to take care of transaction themselves, which is >>>> hard. And different data sources may come up with a very similar approach >>>> for the transaction, which leads to many duplicated codes. >>>> >>>> To solve these pain points, I'm proposing the data source v2 writing >>>> framework which is very similar to the reading framework, i.e., >>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter. >>>> >>>> Data Source V2 write path follows the existing FileCommitProtocol, and >>>> have task/job level commit/abort, so that data sources can implement >>>> transaction easier. >>>> >>>> We can create a mix-in trait for DataSourceV2Writer to specify the >>>> requirement for input data, like clustering and ordering. >>>> >>>> Spark provides a very simple protocol for uses to connect to data >>>> sources. A common way to write a dataframe to data sources: >>>> `df.write.format(...).option(...).mode(...).save()`. >>>> Spark passes the options and save mode to data sources, and schedules >>>> the write job on the input data. And the data source should take care of >>>> the metadata, e.g., the JDBC data source can create the table if it doesn't >>>> exist, or fail the job and ask users to create the table in the >>>> corresponding database first. Data sources can define some options for >>>> users to carry some metadata information like partitioning/bucketing. >>>> >>>> >>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>> >>>> +1: Yeah, let's go forward and implement the SPIP. >>>> +0: Don't really care. >>>> -1: I don't think this is a good idea because of the following >>>> technical reasons. >>>> >>>> Thanks! >>>> >>> >>> >> >