Hi all, I did see the community discuss moving to V2 datasource API before [1] but get no more progress. So I want to bring up the discussion again to move to spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD api to index, repartition and so on given the flexibility of RDD API. However V2 api eliminates RDD usage and introduces CatalogPlugin mechanism to give the ability to manage Hudi tables and totally new writing and reading interface, so it caused some challenges since Hudi uses the RDD in both writing and reading path, However I think it is still necessary to integrate Hudi with V2 api as the V1 api is too old and the benefits from V2 api optimization such as more pushdown filters regarding query side to accelerate the query speed when integrating with RFC-27 [2].
And here is work I think we should do when moving to V2 api. 1. Integrate with V2 writing interface(Bulk_insert row path already implemented, but not for upsert/insert operations, would fallback to V1 writing code path) 2. Integrate with V2 reading interface 3. Introducing CatalogPlugin to manage Hudi tables 4. Total use V2 writing interface(use Iterator<InternalRow> that may need some refactor to HoodieSparkWriteClient to make precombining, indexing etc working fine). Please add other work that no mentioned above and would love to hear other opinions and feedback from the community. I see there is already an umbrella ticket to track datasource V2 [3] and I will put on a RFC for more details, also you would join the channel #spark-datasource-v2 in Hudi slack for more discussion [1] https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E [2] https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance [3] https://issues.apache.org/jira/browse/HUDI-1297 Thanks Leesf
