[DISCUSS] Move to Spark DataSource V2 API

leesf Tue, 09 Nov 2021 07:55:07 -0800

Hi all,

I did see the community discuss moving to V2 datasource API before [1] but
get no more progress. So I want to bring up the discussion again to move to
spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
api to index, repartition and so on given the flexibility of RDD API.
However V2 api eliminates RDD usage and introduces CatalogPlugin mechanism
to give the ability to manage Hudi tables and totally new writing and
reading interface, so it caused some challenges since Hudi uses the RDD in
both writing and reading path, However I think it is still necessary to
integrate Hudi with V2 api as the V1 api is too old and the benefits from
V2 api optimization such as more pushdown filters regarding query side to
accelerate the query speed when integrating with RFC-27 [2].


And here is work I think we should do when moving to V2 api.

1. Integrate with V2 writing interface(Bulk_insert row path already
implemented, but not for upsert/insert operations, would fallback to V1
writing code path)
2. Integrate with V2 reading interface
3. Introducing CatalogPlugin to manage Hudi tables
4. Total use V2 writing interface(use Iterator<InternalRow> that may need
some refactor to HoodieSparkWriteClient to make precombining, indexing etc
working fine).

Please add other work that no mentioned above and would love to hear other
opinions and feedback from the community. I see there is already an
umbrella ticket to track datasource V2 [3] and I will put on a RFC for more
details, also you would join the channel #spark-datasource-v2 in Hudi slack
for more discussion

[1]
https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
[2]
https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
[3] https://issues.apache.org/jira/browse/HUDI-1297



Thanks
Leesf

[DISCUSS] Move to Spark DataSource V2 API

Reply via email to