[GitHub] [hudi] leesf commented on pull request #3964: [HUDI-2732][RFC-38] Spark Datasource V2 Integration

GitBox Wed, 08 Dec 2021 17:59:37 -0800


leesf commented on pull request #3964:
URL: https://github.com/apache/hudi/pull/3964#issuecomment-989431132



   > @leesf Love to understand the plan going forward here and how we plan to 
migrate the existing v1 write path onto the v2 APIs. Specifically, current v1 
upsert pipeline consists of the following logical stages ` preCombine -> index 
-> partition -> write` before committing out the files. In other words, we 
benefit from v1 API providing ways to shuffle the dataframe further before 
writing to disk and IIUC v2 takes this flexibility away?
   > 
   > Assuming I am correct (and spark has not introduced any new APIs that help 
us mitigate this), should we do the following?
   > 
   > * introduce a new `hudiv2` datasource i.e `spark.write.format("hudiv2")` 
that just supports bulk_insert on the datasource write path.
   > * We also add a new `SparkDatasetWriteClient` which exposes methods for 
upsert,delete, .. and we use that as the basis for our SQL/DML layer as well.
   > * We continue to support the v1 `hudi` datasource as-is for sometime. 
There are lots of users who like how they can do upserts/deletes by executing a 
`spark.write.format("hudi").option()...`
   
   @vinothchandar In fact, I do not intend to introduce "hudiv2" format when 
introducing V2 code path, since it will make end users change their code and 
the "hudiv2" is not a good name("hudi" is good enough) IMO, instead I would 
like to change the former "hudi" format into "hudi_internal" and make "hudi" 
format as the v2 code path as default to make it transparent for end users, and 
integrate with current bulk_insert V2 write path. 
   And In the first phase, we would fallback to V1 write path while introduce 
V2 interface(HoodieCatalog and HoodieInternalTableV2), and integrate with 
current bulk_insert V2 write path. In the second phase, we would explore the 
way to integrate with `SparkDatasetWriteClient` which @xushiyan did a PoC to 
make it purely V2 code path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on pull request #3964: [HUDI-2732][RFC-38] Spark Datasource V2 Integration

Reply via email to