leesf commented on pull request #3964: URL: https://github.com/apache/hudi/pull/3964#issuecomment-989431132
> @leesf Love to understand the plan going forward here and how we plan to migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 upsert pipeline consists of the following logical stages ` preCombine -> index -> partition -> write` before committing out the files. In other words, we benefit from v1 API providing ways to shuffle the dataframe further before writing to disk and IIUC v2 takes this flexibility away? > > Assuming I am correct (and spark has not introduced any new APIs that help us mitigate this), should we do the following? > > * introduce a new `hudiv2` datasource i.e `spark.write.format("hudiv2")` that just supports bulk_insert on the datasource write path. > * We also add a new `SparkDatasetWriteClient` which exposes methods for upsert,delete, .. and we use that as the basis for our SQL/DML layer as well. > * We continue to support the v1 `hudi` datasource as-is for sometime. There are lots of users who like how they can do upserts/deletes by executing a `spark.write.format("hudi").option()...` @vinothchandar In fact, I do not intend to introduce "hudiv2" format when introducing V2 code path, since it will make end users change their code and the "hudiv2" is not a good name("hudi" is good enough) IMO, instead I would like to change the former "hudi" format into "hudi_internal" and make "hudi" format as the v2 code path as default to make it transparent for end users, and integrate with current bulk_insert V2 write path. And In the first phase, we would fallback to V1 write path while introduce V2 interface(HoodieCatalog and HoodieInternalTableV2), and integrate with current bulk_insert V2 write path. In the second phase, we would explore the way to integrate with `SparkDatasetWriteClient` which @xushiyan did a PoC to make it purely V2 code path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org