Folks, As you may know, we still use the V1 API, given it the flexibility further transform the dataframe, after one calls `df.write.format()`, to implement a fully featured write pipeline with precombining, indexing, custom partitioning. V2 API takes this away and rather provides a very restrictive API that simply provides a partition level write interface that hands a Iterator<InternalRow>.
That said, v2 has evolved (again) in Spark 3 and we would like to get with the V2 APIs at some point, for both querying and writing. This thread summarizes a few approaches we can take. *Option 1 : Introduce a pre write hook in Spark Datasource API* If datasource API provided a simple way to further transform dataframes, after the call to df.write.format is done, we would be able to move much of our HoodieSparkSQLWriter logic into that and make the transition. Sivabalan engaged with the Spark community around this, without luck. Anyone who can help revive this or make a more successful attempt, please chime in. *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource + HoodieSparkClient API* We would limit datasource writes to simply bulk_inserts or insert_overwrites. All other write operations would be supported via a new HoodieSparkClient API (similar to all the write clients we have, but works with DataSet<Row>). Queries will be supported on the v2 APIs. This will be done only for Spark 3. We would still keep the current v1 support until Spark supports it. Obviously, users have to migrate pipelines to hudi_v2 at some point, if datasource v1 support is dropped My concern is having two datasources, causing greater confusion for the users. Maybe there are others that I did not list out here. Please add Thanks Vinoth
