Hi all,

I want to have some discussion about Data Source V2 write path before
starting a voting.

The Data Source V1 write path asks implementations to write a DataFrame
directly, which is painful:
1. Exposing upper-level API like DataFrame to Data Source API is not good
for maintenance.
2. Data sources may need to preprocess the input data before writing, like
cluster/sort the input by some columns. It's better to do the preprocessing
in Spark instead of in the data source.
3. Data sources need to take care of transaction themselves, which is hard.
And different data sources may come up with a very similar approach for
the transaction, which leads to many duplicated codes.


To solve these pain points, I'm proposing a data source writing framework
which is very similar to the reading framework, i.e., WriteSupport ->
DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my
prototype to see what it looks like:
https://github.com/apache/spark/pull/19269

There are some other details need further discussion:
1. *partitioning/bucketing*
Currently only the built-in file-based data sources support them, but there
is nothing stopping us from exposing them to all data sources. One question
is, shall we make them as mix-in interfaces for data source v2
reader/writer, or just encode them into data source options(a
string-to-string map)? Ideally it's more like options, Spark just transfers
these user-given informations to data sources, and doesn't do anything for
it.

2. *input data requirement*
Data sources should be able to ask Spark to preprocess the input data, and
this can be a mix-in interface for DataSourceV2Writer. I think we need to
add clustering request and sorting within partitions request, any more?

3. *transaction*
I think we can just follow `FileCommitProtocol`, which is the internal
framework Spark uses to guarantee transaction for built-in file-based data
sources. Generally speaking, we need task level and job level commit/abort.
Again you can see more details in my prototype about it:
https://github.com/apache/spark/pull/19269

4. *data source table*
This is the trickiest one. In Spark you can create a table which points to
a data source, so you can read/write this data source easily by referencing
the table name. Ideally data source table is just a pointer which points to
a data source with a list of predefined options, to save users from typing
these options again and again for each query.
If that's all, then everything is good, we don't need to add more
interfaces to Data Source V2. However, data source tables provide special
operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
sources to have some extra ability.
Currently these special operators only work for built-in file-based data
sources, and I don't think we will extend it in the near future, I propose
to mark them as out of the scope.


Any comments are welcome!
Thanks,
Wenchen

Reply via email to