Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

Minghuang Li Fri, 30 Aug 2024 00:51:51 -0700

Hello Hudi Devs,

First and foremost, I would like to express my admiration for the Apache Hudi 
project. The innovation and robust features you've brought to data lake 
technology management are truly impressive and are greatly valued by the 
developer community.


I'm currently integrating Apache Hudi into Apache Gravitino[1] project to more 
efficiently manage data lake metadata. We plan to implement a Hudi catalog[2] 
in Gravitino and I am reaching out for advice to ensure we align with Hudi's 
best practices and future direction.

Through my research into the Hudi project, I have noted the current state of 
metadata management (please correct me if I am wrong):

        1. Hudi does not currently offer a unified catalog interface 
specification (for instance, a unified interface for Table metadata. The 
existing HoodieTable seems designed for table data read/write, not metadata).
        2. Hudi provides various sync tools that can sync metadata to an 
external catalog post-data write. Although they implement the 
HoodieMetaSyncOperations interface, it does not offer Hudi database and table 
abstractions, and seems unable to guarantee consistency (e.g., data write 
succeeds but metadata sync fails).

Based on these observations, a couple of things I’m hoping to get your insights 
on:

Catalog Interface: Is there a stable and unified catalog interface in Hudi that 
we can use to ensure compatibility across different Hudi versions? If such an 
interface exists, could you point me towards some documentation or examples? If 
not, what approach would you recommend for unifying access to Hudi metadata?

Future Developments: Are there any plans for official catalog management 
features in Hudi? We want to ensure our implementation is future-proof and 
would appreciate any details on upcoming enhancements that might impact catalog 
management.

Engine Support: Gravitino supports Spark versions 3.3, 3.4, and 3.5. Currently, 
only the latest version of Hudi (0.15) supports Spark 3.5. I am concerned that 
developing on this version might introduce stability and compatibility issues. 
Additionally, Gravitino's Spark plugin is based on the Spark v2 interface, 
while Hudi's Spark support uses the v1 interface. I've seen plans in the 
community about supporting Spark v2; could you provide a timeline for this? 
This will also determine how Gravitino's Spark plugin will implement Hudi 
querying moving forward.

I would greatly appreciate any guidance and support the Hudi community can 
offer. Your insights would be invaluable in ensuring the successful integration 
of Hudi into our project. Thank you very much for your time and assistance!

Best regards,
Minghuang Li

[1] https://github.com/apache/gravitino
[2] https://lists.apache.org/thread/bmz4xsv2ogpccy5wtopyy9hp1cot317b

Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

Reply via email to