Hello Hudi Devs, First and foremost, I would like to express my admiration for the Apache Hudi project. The innovation and robust features you've brought to data lake technology management are truly impressive and are greatly valued by the developer community.
I'm currently integrating Apache Hudi into Apache Gravitino[1] project to more efficiently manage data lake metadata. We plan to implement a Hudi catalog[2] in Gravitino and I am reaching out for advice to ensure we align with Hudi's best practices and future direction. Through my research into the Hudi project, I have noted the current state of metadata management (please correct me if I am wrong): 1. Hudi does not currently offer a unified catalog interface specification (for instance, a unified interface for Table metadata. The existing HoodieTable seems designed for table data read/write, not metadata). 2. Hudi provides various sync tools that can sync metadata to an external catalog post-data write. Although they implement the HoodieMetaSyncOperations interface, it does not offer Hudi database and table abstractions, and seems unable to guarantee consistency (e.g., data write succeeds but metadata sync fails). Based on these observations, a couple of things I’m hoping to get your insights on: Catalog Interface: Is there a stable and unified catalog interface in Hudi that we can use to ensure compatibility across different Hudi versions? If such an interface exists, could you point me towards some documentation or examples? If not, what approach would you recommend for unifying access to Hudi metadata? Future Developments: Are there any plans for official catalog management features in Hudi? We want to ensure our implementation is future-proof and would appreciate any details on upcoming enhancements that might impact catalog management. Engine Support: Gravitino supports Spark versions 3.3, 3.4, and 3.5. Currently, only the latest version of Hudi (0.15) supports Spark 3.5. I am concerned that developing on this version might introduce stability and compatibility issues. Additionally, Gravitino's Spark plugin is based on the Spark v2 interface, while Hudi's Spark support uses the v1 interface. I've seen plans in the community about supporting Spark v2; could you provide a timeline for this? This will also determine how Gravitino's Spark plugin will implement Hudi querying moving forward. I would greatly appreciate any guidance and support the Hudi community can offer. Your insights would be invaluable in ensuring the successful integration of Hudi into our project. Thank you very much for your time and assistance! Best regards, Minghuang Li [1] https://github.com/apache/gravitino [2] https://lists.apache.org/thread/bmz4xsv2ogpccy5wtopyy9hp1cot317b