Hi all,

After we merge the infrastructure of data source v2 read path, and have
some discussion for the write path, now I'm sending this email to call a
vote for Data Source v2 write path.

The full document of the Data Source API V2 is:
https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit

The ready-for-review PR that implements the basic infrastructure for the
write path:
https://github.com/apache/spark/pull/19269


The Data Source V1 write path asks implementations to write a DataFrame
directly, which is painful:
1. Exposing upper-level API like DataFrame to Data Source API is not good
for maintenance.
2. Data sources may need to preprocess the input data before writing, like
cluster/sort the input by some columns. It's better to do the preprocessing
in Spark instead of in the data source.
3. Data sources need to take care of transaction themselves, which is hard.
And different data sources may come up with a very similar approach for the
transaction, which leads to many duplicated codes.

To solve these pain points, I'm proposing the data source v2 writing
framework which is very similar to the reading framework, i.e.,
WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.

Data Source V2 write path follows the existing FileCommitProtocol, and have
task/job level commit/abort, so that data sources can implement transaction
easier.

We can create a mix-in trait for DataSourceV2Writer to specify the
requirement for input data, like clustering and ordering.

Spark provides a very simple protocol for uses to connect to data sources.
A common way to write a dataframe to data sources:
`df.write.format(...).option(...).mode(...).save()`.
Spark passes the options and save mode to data sources, and schedules the
write job on the input data. And the data source should take care of the
metadata, e.g., the JDBC data source can create the table if it doesn't
exist, or fail the job and ask users to create the table in the
corresponding database first. Data sources can define some options for
users to carry some metadata information like partitioning/bucketing.


The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Thanks!

Reply via email to