[DISCUSS] Move to spark v2 datasource API

Vinoth Chandar Wed, 14 Jul 2021 21:13:51 -0700

Folks,

As you may know, we still use the V1 API, given it the flexibility further
transform the dataframe, after one calls `df.write.format()`, to implement
a fully featured write pipeline with precombining, indexing, custom
partitioning. V2 API takes this away and rather provides a very restrictive
API that simply provides a partition level write interface that hands a
Iterator<InternalRow>.


That said, v2 has evolved (again) in Spark 3 and we would like to get with
the V2 APIs at some point, for both querying and writing. This thread
summarizes a few approaches we can take.

*Option 1 : Introduce a pre write hook in Spark Datasource API*
If datasource API provided a simple way to further transform dataframes,
after the call to df.write.format is done, we would be able to move much of
our HoodieSparkSQLWriter logic into that and make the transition.

Sivabalan engaged with the Spark community around this, without luck.
Anyone who can help revive this or make a more successful attempt, please
chime in.

*Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
HoodieSparkClient API*

We would limit datasource writes to simply bulk_inserts or
insert_overwrites. All other write operations would be supported via a new
HoodieSparkClient API (similar to all the write clients we have, but works
with DataSet<Row>). Queries will be supported on the v2 APIs. This will be
done only for Spark 3.

We would still keep the current v1 support until Spark supports it.
Obviously, users have to migrate pipelines to hudi_v2 at some point, if
datasource v1 support is dropped

My concern is having two datasources, causing greater confusion for the
users.

Maybe there are others that I did not list out here. Please add

Thanks
Vinoth

[DISCUSS] Move to spark v2 datasource API

Reply via email to