Re: Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

Minghuang Li Sun, 01 Sep 2024 20:20:34 -0700

Hi He,

Thank you for your reminder.


Apache Gravitino is a high-performance, geo-distributed, and federated metadata 
lake. It manages metadata directly in different sources, types, and regions and 
provides users with unified metadata access for data and AI assets. It was 
donated to ASF by Datastrato[1] in June 2024.

Gravitino current supports metadata management for Apache Hive, Apache Iceberg, 
Apache Paimon, Apache Doris, Apache Kafka, etc.As the first official version of 
Gravitino after its donation to ASF is still being released, its documentation 
has not been fully migrated to ASF. More detailed information about Gravitino 
can be found in the documentation for version 0.5.1 [2].

Best regards,
Minghuang Li

[1] https://datastrato.ai
[2] https://datastrato.ai/docs/latest

On 2024/08/30 10:08:43 He Qi wrote:
> Maybe you can give more background about Gravitino.
> 
> On 2024/08/30 07:50:31 Minghuang Li wrote:
> > Hello Hudi Devs,
> > 
> > First and foremost, I would like to express my admiration for the Apache 
> > Hudi project. The innovation and robust features you've brought to data 
> > lake technology management are truly impressive and are greatly valued by 
> > the developer community.
> > 
> > I'm currently integrating Apache Hudi into Apache Gravitino[1] project to 
> > more efficiently manage data lake metadata. We plan to implement a Hudi 
> > catalog[2] in Gravitino and I am reaching out for advice to ensure we align 
> > with Hudi's best practices and future direction.
> > 
> > Through my research into the Hudi project, I have noted the current state 
> > of metadata management (please correct me if I am wrong):
> > 
> >     1. Hudi does not currently offer a unified catalog interface 
> > specification (for instance, a unified interface for Table metadata. The 
> > existing HoodieTable seems designed for table data read/write, not 
> > metadata).
> >     2. Hudi provides various sync tools that can sync metadata to an 
> > external catalog post-data write. Although they implement the 
> > HoodieMetaSyncOperations interface, it does not offer Hudi database and 
> > table abstractions, and seems unable to guarantee consistency (e.g., data 
> > write succeeds but metadata sync fails).
> > 
> > Based on these observations, a couple of things I’m hoping to get your 
> > insights on:
> > 
> > Catalog Interface: Is there a stable and unified catalog interface in Hudi 
> > that we can use to ensure compatibility across different Hudi versions? If 
> > such an interface exists, could you point me towards some documentation or 
> > examples? If not, what approach would you recommend for unifying access to 
> > Hudi metadata?
> > 
> > Future Developments: Are there any plans for official catalog management 
> > features in Hudi? We want to ensure our implementation is future-proof and 
> > would appreciate any details on upcoming enhancements that might impact 
> > catalog management.
> > 
> > Engine Support: Gravitino supports Spark versions 3.3, 3.4, and 3.5. 
> > Currently, only the latest version of Hudi (0.15) supports Spark 3.5. I am 
> > concerned that developing on this version might introduce stability and 
> > compatibility issues. Additionally, Gravitino's Spark plugin is based on 
> > the Spark v2 interface, while Hudi's Spark support uses the v1 interface. 
> > I've seen plans in the community about supporting Spark v2; could you 
> > provide a timeline for this? This will also determine how Gravitino's Spark 
> > plugin will implement Hudi querying moving forward.
> > 
> > I would greatly appreciate any guidance and support the Hudi community can 
> > offer. Your insights would be invaluable in ensuring the successful 
> > integration of Hudi into our project. Thank you very much for your time and 
> > assistance!
> > 
> > Best regards,
> > Minghuang Li
> > 
> > [1] https://github.com/apache/gravitino
> > [2] https://lists.apache.org/thread/bmz4xsv2ogpccy5wtopyy9hp1cot317b
> > 
> > 
>

Re: Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

Reply via email to