Great initiative and idea, Leesf. Totally agreed on the benefits of adopting V2 APIs. On the 4th point "Total use V2 writing interface"
I have previously worked on implementing upsert with V2 writing interface with SimpleIndex using broadcast join. The POC worked without fully integrating with other table services. The downside of going this route would be re-implementing most of the logic we have today with the RDD writer path, including different indexing implementations, which are non-trivial. Another route I've PoC'ed is to treat the current RDD writer path as Hudi "writer framework": input Dataset<Row> going through different components as we see today Client -> Specific ActionExecutor -> Helper -> (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> (map partitions and perform write on Row iterator via parquet writer/reader) -> return Dataset<WriterStatus> As you can see, the 1st approach is to adopt an engine-native framework (V2 writing interface in this case) to realize Hudi operations while the 2nd approach is to adopt the Hudi "writer framework" by using engine-native data-level APIs to realize Hudi operations. The 2nd approach gives better flexibility in adopting different engines; it leverages engines' capabilities to manipulate data while ensuring write operations were realized in the "Hudi" way. The prerequisite to this is to have a flexible Hudi abstraction on top of different engines' data-level APIs. Ethan has landed 2 major abstraction PRs to pave the way for it, which will enable a great deal of code-reuse. The Hudi "writer framework" today consists of a bunch of Java classes. It can be formalized and refactored along the way while implementing Row writing. Once the "framework" is formalized, its flexibility can really shine on bringing in new processing engines to Hudi. Something similar could be done on the reader path too I suppose. On Tue, Nov 9, 2021 at 7:55 AM leesf <[email protected]> wrote: > Hi all, > > I did see the community discuss moving to V2 datasource API before [1] but > get no more progress. So I want to bring up the discussion again to move to > spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD > api to index, repartition and so on given the flexibility of RDD API. > However V2 api eliminates RDD usage and introduces CatalogPlugin mechanism > to give the ability to manage Hudi tables and totally new writing and > reading interface, so it caused some challenges since Hudi uses the RDD in > both writing and reading path, However I think it is still necessary to > integrate Hudi with V2 api as the V1 api is too old and the benefits from > V2 api optimization such as more pushdown filters regarding query side to > accelerate the query speed when integrating with RFC-27 [2]. > > And here is work I think we should do when moving to V2 api. > > 1. Integrate with V2 writing interface(Bulk_insert row path already > implemented, but not for upsert/insert operations, would fallback to V1 > writing code path) > 2. Integrate with V2 reading interface > 3. Introducing CatalogPlugin to manage Hudi tables > 4. Total use V2 writing interface(use Iterator<InternalRow> that may need > some refactor to HoodieSparkWriteClient to make precombining, indexing etc > working fine). > > Please add other work that no mentioned above and would love to hear other > opinions and feedback from the community. I see there is already an > umbrella ticket to track datasource V2 [3] and I will put on a RFC for more > details, also you would join the channel #spark-datasource-v2 in Hudi slack > for more discussion > > [1] > > https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E > [2] > > https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance > [3] https://issues.apache.org/jira/browse/HUDI-1297 > > > > Thanks > Leesf >
