Hi all, Sorry. Bit late to the party here. +1 on kicking this off and +1 on reusing the work Raymond has already kickstarted here. I think we are in a good position to roll with this approach.
The biggest issue with V2 on the writing side, remains that fact that we cannot really "shuffle" data for precombine or indexing after we issue spark.write.format(..) ? I propose the following approach - Instead of changing the existing V1 datasource, we make a new "*hudi-v2*" datasource - Writes in "hudi-v2" just supports bulk_insert or overwrites like what the spark.parquet.write does. - We introduce a new `SparkDataSetWriteClient` or equivalent, that offer programmatic access for all the rich APIs we have currently so users can use it for delete, insert, update, index, etc... I think this approach is really flexible. users can use v1 for streaming updates if they prefer that. It also offers a migration path from v1 to v2 that does not involve breaking the pipelines. I see that there is a RFC open. Will get on it once my release blockers are in better shape Thanks Vinoth On Mon, Nov 15, 2021 at 6:26 AM leesf <[email protected]> wrote: > Thanks Raymond for sharing the work has been done, agree that the 1st > approach would need more work and time to make it totally adapted to V2 > interfaces and would be different with different engines. Thus, to abstract > the Hudi core writing/reading framework to adapt to different engines > (approach 2 ) looks good to me at the moment since we do not need extra > work to adapt to other engines and focus on spark writing/reading side. > > Raymond Xu <[email protected]> 于2021年11月14日周日 下午5:44写道: > > > Great initiative and idea, Leesf. > > > > Totally agreed on the benefits of adopting V2 APIs. On the 4th point > "Total > > use V2 writing interface" > > > > I have previously worked on implementing upsert with V2 writing interface > > with SimpleIndex using broadcast join. The POC worked without fully > > integrating with other table services. The downside of going this route > > would be re-implementing most of the logic we have today with the RDD > > writer path, including different indexing implementations, which are > > non-trivial. > > > > Another route I've PoC'ed is to treat the current RDD writer path as Hudi > > "writer framework": input Dataset<Row> going through different components > > as we see today Client -> Specific ActionExecutor -> Helper -> > > (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> > (map > > partitions and perform write on Row iterator via parquet writer/reader) > -> > > return Dataset<WriterStatus> > > > > As you can see, the 1st approach is to adopt an engine-native framework > (V2 > > writing interface in this case) to realize Hudi operations while the 2nd > > approach is to adopt the Hudi "writer framework" by using engine-native > > data-level APIs to realize Hudi operations. The 2nd approach gives better > > flexibility in adopting different engines; it leverages engines' > > capabilities to manipulate data while ensuring write operations > > were realized in the "Hudi" way. The prerequisite to this is to have a > > flexible Hudi abstraction on top of different engines' data-level APIs. > > Ethan has landed 2 major abstraction PRs to pave the way for it, which > will > > enable a great deal of code-reuse. > > > > The Hudi "writer framework" today consists of a bunch of Java classes. It > > can be formalized and refactored along the way while implementing Row > > writing. Once the "framework" is formalized, its flexibility can really > > shine on bringing in new processing engines to Hudi. Something similar > > could be done on the reader path too I suppose. > > > > On Tue, Nov 9, 2021 at 7:55 AM leesf <[email protected]> wrote: > > > > > Hi all, > > > > > > I did see the community discuss moving to V2 datasource API before [1] > > but > > > get no more progress. So I want to bring up the discussion again to > move > > to > > > spark datasource V2 api, Hudi still uses V1 api and relies heavily on > RDD > > > api to index, repartition and so on given the flexibility of RDD API. > > > However V2 api eliminates RDD usage and introduces CatalogPlugin > > mechanism > > > to give the ability to manage Hudi tables and totally new writing and > > > reading interface, so it caused some challenges since Hudi uses the RDD > > in > > > both writing and reading path, However I think it is still necessary to > > > integrate Hudi with V2 api as the V1 api is too old and the benefits > from > > > V2 api optimization such as more pushdown filters regarding query side > to > > > accelerate the query speed when integrating with RFC-27 [2]. > > > > > > And here is work I think we should do when moving to V2 api. > > > > > > 1. Integrate with V2 writing interface(Bulk_insert row path already > > > implemented, but not for upsert/insert operations, would fallback to V1 > > > writing code path) > > > 2. Integrate with V2 reading interface > > > 3. Introducing CatalogPlugin to manage Hudi tables > > > 4. Total use V2 writing interface(use Iterator<InternalRow> that may > need > > > some refactor to HoodieSparkWriteClient to make precombining, indexing > > etc > > > working fine). > > > > > > Please add other work that no mentioned above and would love to hear > > other > > > opinions and feedback from the community. I see there is already an > > > umbrella ticket to track datasource V2 [3] and I will put on a RFC for > > more > > > details, also you would join the channel #spark-datasource-v2 in Hudi > > slack > > > for more discussion > > > > > > [1] > > > > > > > > > https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E > > > [2] > > > > > > > > > https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance > > > [3] https://issues.apache.org/jira/browse/HUDI-1297 > > > > > > > > > > > > Thanks > > > Leesf > > > > > >
