Re: [DISCUSS] Move to Spark DataSource V2 API

leesf Mon, 15 Nov 2021 06:26:55 -0800

Thanks Raymond for sharing the work has been done, agree that the 1st
approach would need more work and time to make it  totally adapted to V2
interfaces and would be different with different engines. Thus, to abstract
the Hudi core writing/reading framework to adapt to different engines
(approach 2 ) looks good to me at the moment since we do not need extra
work to adapt to other engines and focus on spark writing/reading side.


Raymond Xu <xu.shiyan.raym...@gmail.com> 于2021年11月14日周日 下午5:44写道：

> Great initiative and idea, Leesf.
>
> Totally agreed on the benefits of adopting V2 APIs. On the 4th point "Total
> use V2 writing interface"
>
> I have previously worked on implementing upsert with V2 writing interface
> with SimpleIndex using broadcast join. The POC worked without fully
> integrating with other table services. The downside of going this route
> would be re-implementing most of the logic we have today with the RDD
> writer path, including different indexing implementations, which are
> non-trivial.
>
> Another route I've PoC'ed is to treat the current RDD writer path as Hudi
> "writer framework": input Dataset<Row> going through different components
> as we see today Client -> Specific ActionExecutor -> Helper ->
> (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> (map
> partitions and perform write on Row iterator via parquet writer/reader) ->
> return Dataset<WriterStatus>
>
> As you can see, the 1st approach is to adopt an engine-native framework (V2
> writing interface in this case) to realize Hudi operations while the 2nd
> approach is to adopt the Hudi "writer framework" by using engine-native
> data-level APIs to realize Hudi operations. The 2nd approach gives better
> flexibility in adopting different engines; it leverages engines'
> capabilities to manipulate data while ensuring write operations
> were realized in the "Hudi" way. The prerequisite to this is to have a
> flexible Hudi abstraction on top of different engines' data-level APIs.
> Ethan has landed 2 major abstraction PRs to pave the way for it, which will
> enable a great deal of code-reuse.
>
> The Hudi "writer framework" today consists of a bunch of Java classes. It
> can be formalized and refactored along the way while implementing Row
> writing. Once the "framework" is formalized, its flexibility can really
> shine on bringing in new processing engines to Hudi. Something similar
> could be done on the reader path too I suppose.
>
> On Tue, Nov 9, 2021 at 7:55 AM leesf <leesf0...@gmail.com> wrote:
>
> > Hi all,
> >
> > I did see the community discuss moving to V2 datasource API before [1]
> but
> > get no more progress. So I want to bring up the discussion again to move
> to
> > spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
> > api to index, repartition and so on given the flexibility of RDD API.
> > However V2 api eliminates RDD usage and introduces CatalogPlugin
> mechanism
> > to give the ability to manage Hudi tables and totally new writing and
> > reading interface, so it caused some challenges since Hudi uses the RDD
> in
> > both writing and reading path, However I think it is still necessary to
> > integrate Hudi with V2 api as the V1 api is too old and the benefits from
> > V2 api optimization such as more pushdown filters regarding query side to
> > accelerate the query speed when integrating with RFC-27 [2].
> >
> > And here is work I think we should do when moving to V2 api.
> >
> > 1. Integrate with V2 writing interface(Bulk_insert row path already
> > implemented, but not for upsert/insert operations, would fallback to V1
> > writing code path)
> > 2. Integrate with V2 reading interface
> > 3. Introducing CatalogPlugin to manage Hudi tables
> > 4. Total use V2 writing interface(use Iterator<InternalRow> that may need
> > some refactor to HoodieSparkWriteClient to make precombining, indexing
> etc
> > working fine).
> >
> > Please add other work that no mentioned above and would love to hear
> other
> > opinions and feedback from the community. I see there is already an
> > umbrella ticket to track datasource V2 [3] and I will put on a RFC for
> more
> > details, also you would join the channel #spark-datasource-v2 in Hudi
> slack
> > for more discussion
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> > [3] https://issues.apache.org/jira/browse/HUDI-1297
> >
> >
> >
> > Thanks
> > Leesf
> >
>

Re: [DISCUSS] Move to Spark DataSource V2 API

Reply via email to