Reply:Re: [DISCUSS] Move to Spark DataSource V2 API

受春柏 Mon, 15 Nov 2021 06:38:21 -0800
Hi，leesf
  Supporting Spark datasource V2 has always been something we wanted to do, and 
we can get involved
在 2021-11-15 22:26:17，"leesf" <[email protected]> 写道：
>Thanks Raymond for sharing the work has been done, agree that the 1st
>approach would need more work and time to make it  totally adapted to V2
>interfaces and would be different with different engines. Thus, to abstract
>the Hudi core writing/reading framework to adapt to different engines
>(approach 2 ) looks good to me at the moment since we do not need extra
>work to adapt to other engines and focus on spark writing/reading side.
>
>Raymond Xu <[email protected]> 于2021年11月14日周日 下午5:44写道：
>
>> Great initiative and idea, Leesf.
>>
>> Totally agreed on the benefits of adopting V2 APIs. On the 4th point "Total
>> use V2 writing interface"
>>
>> I have previously worked on implementing upsert with V2 writing interface
>> with SimpleIndex using broadcast join. The POC worked without fully
>> integrating with other table services. The downside of going this route
>> would be re-implementing most of the logic we have today with the RDD
>> writer path, including different indexing implementations, which are
>> non-trivial.
>>
>> Another route I've PoC'ed is to treat the current RDD writer path as Hudi
>> "writer framework": input Dataset<Row> going through different components
>> as we see today Client -> Specific ActionExecutor -> Helper ->
>> (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> (map
>> partitions and perform write on Row iterator via parquet writer/reader) ->
>> return Dataset<WriterStatus>
>>
>> As you can see, the 1st approach is to adopt an engine-native framework (V2
>> writing interface in this case) to realize Hudi operations while the 2nd
>> approach is to adopt the Hudi "writer framework" by using engine-native
>> data-level APIs to realize Hudi operations. The 2nd approach gives better
>> flexibility in adopting different engines; it leverages engines'
>> capabilities to manipulate data while ensuring write operations
>> were realized in the "Hudi" way. The prerequisite to this is to have a
>> flexible Hudi abstraction on top of different engines' data-level APIs.
>> Ethan has landed 2 major abstraction PRs to pave the way for it, which will
>> enable a great deal of code-reuse.
>>
>> The Hudi "writer framework" today consists of a bunch of Java classes. It
>> can be formalized and refactored along the way while implementing Row
>> writing. Once the "framework" is formalized, its flexibility can really
>> shine on bringing in new processing engines to Hudi. Something similar
>> could be done on the reader path too I suppose.
>>
>> On Tue, Nov 9, 2021 at 7:55 AM leesf <[email protected]> wrote:
>>
>> > Hi all,
>> >
>> > I did see the community discuss moving to V2 datasource API before [1]
>> but
>> > get no more progress. So I want to bring up the discussion again to move
>> to
>> > spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
>> > api to index, repartition and so on given the flexibility of RDD API.
>> > However V2 api eliminates RDD usage and introduces CatalogPlugin
>> mechanism
>> > to give the ability to manage Hudi tables and totally new writing and
>> > reading interface, so it caused some challenges since Hudi uses the RDD
>> in
>> > both writing and reading path, However I think it is still necessary to
>> > integrate Hudi with V2 api as the V1 api is too old and the benefits from
>> > V2 api optimization such as more pushdown filters regarding query side to
>> > accelerate the query speed when integrating with RFC-27 [2].
>> >
>> > And here is work I think we should do when moving to V2 api.
>> >
>> > 1. Integrate with V2 writing interface(Bulk_insert row path already
>> > implemented, but not for upsert/insert operations, would fallback to V1
>> > writing code path)
>> > 2. Integrate with V2 reading interface
>> > 3. Introducing CatalogPlugin to manage Hudi tables
>> > 4. Total use V2 writing interface(use Iterator<InternalRow> that may need
>> > some refactor to HoodieSparkWriteClient to make precombining, indexing
>> etc
>> > working fine).
>> >
>> > Please add other work that no mentioned above and would love to hear
>> other
>> > opinions and feedback from the community. I see there is already an
>> > umbrella ticket to track datasource V2 [3] and I will put on a RFC for
>> more
>> > details, also you would join the channel #spark-datasource-v2 in Hudi
>> slack
>> > for more discussion
>> >
>> > [1]
>> >
>> >
>> https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
>> > [2]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
>> > [3] https://issues.apache.org/jira/browse/HUDI-1297
>> >
>> >
>> >
>> > Thanks
>> > Leesf
>> >
>>
Reply:Re: [DISCUSS] Move to Spark DataSource V2 API

Reply via email to