Re: Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

He Qi Fri, 30 Aug 2024 03:08:48 -0700

Maybe you can give more background about Gravitino.


On 2024/08/30 07:50:31 Minghuang Li wrote:
> Hello Hudi Devs,
> 
> First and foremost, I would like to express my admiration for the Apache Hudi 
> project. The innovation and robust features you've brought to data lake 
> technology management are truly impressive and are greatly valued by the 
> developer community.
> 
> I'm currently integrating Apache Hudi into Apache Gravitino[1] project to 
> more efficiently manage data lake metadata. We plan to implement a Hudi 
> catalog[2] in Gravitino and I am reaching out for advice to ensure we align 
> with Hudi's best practices and future direction.
> 
> Through my research into the Hudi project, I have noted the current state of 
> metadata management (please correct me if I am wrong):
> 
>       1. Hudi does not currently offer a unified catalog interface 
> specification (for instance, a unified interface for Table metadata. The 
> existing HoodieTable seems designed for table data read/write, not metadata).
>       2. Hudi provides various sync tools that can sync metadata to an 
> external catalog post-data write. Although they implement the 
> HoodieMetaSyncOperations interface, it does not offer Hudi database and table 
> abstractions, and seems unable to guarantee consistency (e.g., data write 
> succeeds but metadata sync fails).
> 
> Based on these observations, a couple of things I’m hoping to get your 
> insights on:
> 
> Catalog Interface: Is there a stable and unified catalog interface in Hudi 
> that we can use to ensure compatibility across different Hudi versions? If 
> such an interface exists, could you point me towards some documentation or 
> examples? If not, what approach would you recommend for unifying access to 
> Hudi metadata?
> 
> Future Developments: Are there any plans for official catalog management 
> features in Hudi? We want to ensure our implementation is future-proof and 
> would appreciate any details on upcoming enhancements that might impact 
> catalog management.
> 
> Engine Support: Gravitino supports Spark versions 3.3, 3.4, and 3.5. 
> Currently, only the latest version of Hudi (0.15) supports Spark 3.5. I am 
> concerned that developing on this version might introduce stability and 
> compatibility issues. Additionally, Gravitino's Spark plugin is based on the 
> Spark v2 interface, while Hudi's Spark support uses the v1 interface. I've 
> seen plans in the community about supporting Spark v2; could you provide a 
> timeline for this? This will also determine how Gravitino's Spark plugin will 
> implement Hudi querying moving forward.
> 
> I would greatly appreciate any guidance and support the Hudi community can 
> offer. Your insights would be invaluable in ensuring the successful 
> integration of Hudi into our project. Thank you very much for your time and 
> assistance!
> 
> Best regards,
> Minghuang Li
> 
> [1] https://github.com/apache/gravitino
> [2] https://lists.apache.org/thread/bmz4xsv2ogpccy5wtopyy9hp1cot317b
> 
>

Re: Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

Reply via email to