Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Wenchen Fan Mon, 09 Oct 2017 18:07:53 -0700

I'm going to update the proposal: for the last point, although the
user-facing API (`df.write.format(...).option(...).mode(...).save()`) mixes
data and metadata operations, we are still able to separate them in the
data source write API. We can have a mix-in trait `MetadataSupport` which
has a method `create(options)`, so that data sources can mix in this trait
and provide metadata creation support. Spark will call this `create` method
inside `DataFrameWriter.save` if the specified data source has it.


Note that file format data sources can ignore this new trait and still
write data without metadata(it doesn't have metadata anyway).

With this updated proposal, I'm calling a new vote for the data source v2
write path.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Thanks!

On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

> Hi all,
>
> After we merge the infrastructure of data source v2 read path, and have
> some discussion for the write path, now I'm sending this email to call a
> vote for Data Source v2 write path.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
> Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for the
> write path:
> https://github.com/apache/spark/pull/19269
>
>
> The Data Source V1 write path asks implementations to write a DataFrame
> directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not good
> for maintenance.
> 2. Data sources may need to preprocess the input data before writing, like
> cluster/sort the input by some columns. It's better to do the preprocessing
> in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
> To solve these pain points, I'm proposing the data source v2 writing
> framework which is very similar to the reading framework, i.e.,
> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>
> Data Source V2 write path follows the existing FileCommitProtocol, and
> have task/job level commit/abort, so that data sources can implement
> transaction easier.
>
> We can create a mix-in trait for DataSourceV2Writer to specify the
> requirement for input data, like clustering and ordering.
>
> Spark provides a very simple protocol for uses to connect to data sources.
> A common way to write a dataframe to data sources:
> `df.write.format(...).option(...).mode(...).save()`.
> Spark passes the options and save mode to data sources, and schedules the
> write job on the input data. And the data source should take care of the
> metadata, e.g., the JDBC data source can create the table if it doesn't
> exist, or fail the job and ask users to create the table in the
> corresponding database first. Data sources can define some options for
> users to carry some metadata information like partitioning/bucketing.
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Reply via email to