Hello Hudi Devs,
First and foremost, I would like to express my admiration for the Apache Hudi
project. The innovation and robust features you've brought to data lake
technology management are truly impressive and are greatly valued by the
developer community.
I'm currently integrating Apache Hudi into Apache Gravitino[1] project to more
efficiently manage data lake metadata. We plan to implement a Hudi catalog[2]
in Gravitino and I am reaching out for advice to ensure we align with Hudi's
best practices and future direction.
Through my research into the Hudi project, I have noted the current state of
metadata management (please correct me if I am wrong):
1. Hudi does not currently offer a unified catalog interface
specification (for instance, a unified interface for Table metadata. The
existing HoodieTable seems designed for table data read/write, not metadata).
2. Hudi provides various sync tools that can sync metadata to an
external catalog post-data write. Although they implement the
HoodieMetaSyncOperations interface, it does not offer Hudi database and table
abstractions, and seems unable to guarantee consistency (e.g., data write
succeeds but metadata sync fails).
Based on these observations, a couple of things I’m hoping to get your insights
on:
Catalog Interface: Is there a stable and unified catalog interface in Hudi that
we can use to ensure compatibility across different Hudi versions? If such an
interface exists, could you point me towards some documentation or examples? If
not, what approach would you recommend for unifying access to Hudi metadata?
Future Developments: Are there any plans for official catalog management
features in Hudi? We want to ensure our implementation is future-proof and
would appreciate any details on upcoming enhancements that might impact catalog
management.
Engine Support: Gravitino supports Spark versions 3.3, 3.4, and 3.5. Currently,
only the latest version of Hudi (0.15) supports Spark 3.5. I am concerned that
developing on this version might introduce stability and compatibility issues.
Additionally, Gravitino's Spark plugin is based on the Spark v2 interface,
while Hudi's Spark support uses the v1 interface. I've seen plans in the
community about supporting Spark v2; could you provide a timeline for this?
This will also determine how Gravitino's Spark plugin will implement Hudi
querying moving forward.
I would greatly appreciate any guidance and support the Hudi community can
offer. Your insights would be invaluable in ensuring the successful integration
of Hudi into our project. Thank you very much for your time and assistance!
Best regards,
Minghuang Li
[1] https://github.com/apache/gravitino
[2] https://lists.apache.org/thread/bmz4xsv2ogpccy5wtopyy9hp1cot317b