leesf commented on pull request #3964:
URL: https://github.com/apache/hudi/pull/3964#issuecomment-989431132
> @leesf Love to understand the plan going forward here and how we plan to
migrate the existing v1 write path onto the v2 APIs. Specifically, current v1
upsert pipeline consists of the following logical stages ` preCombine -> index
-> partition -> write` before committing out the files. In other words, we
benefit from v1 API providing ways to shuffle the dataframe further before
writing to disk and IIUC v2 takes this flexibility away?
>
> Assuming I am correct (and spark has not introduced any new APIs that help
us mitigate this), should we do the following?
>
> * introduce a new `hudiv2` datasource i.e `spark.write.format("hudiv2")`
that just supports bulk_insert on the datasource write path.
> * We also add a new `SparkDatasetWriteClient` which exposes methods for
upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.
> * We continue to support the v1 `hudi` datasource as-is for sometime.
There are lots of users who like how they can do upserts/deletes by executing a
`spark.write.format("hudi").option()...`
@vinothchandar In fact, I do not intend to introduce "hudiv2" format when
introducing V2 code path, since it will make end users change their code and
the "hudiv2" is not a good name("hudi" is good enough) IMO, instead I would
like to change the former "hudi" format into "hudi_internal" and make "hudi"
format as the v2 code path as default to make it transparent for end users, and
integrate with current bulk_insert V2 write path.
And In the first phase, we would fallback to V1 write path while introduce
V2 interface(HoodieCatalog and HoodieInternalTableV2), and integrate with
current bulk_insert V2 write path. In the second phase, we would explore the
way to integrate with `SparkDatasetWriteClient` which @xushiyan did a PoC to
make it purely V2 code path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org