Dear Hudi Community, I would like to propose the development of a new Trino plugin/connector for Hudi.
Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and read-optimized queries on Merge-On-Read tables with Trino, through the input format based integration in the Hive connector [1 <https://github.com/prestodb/presto/commits?author=vinothchandar>]. This approach has known performance limitations with very large tables, which has been since fixed on PrestoDB [2 <https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are working on replicating the same fixes on Trino as well [3 <https://github.com/trinodb/trino/pull/9641>]. However, as Hudi keeps getting better, a new plugin to provide access to Hudi data and metadata will help in unlocking those capabilities for the Trino users. Just to name a few benefits, metadata-based listing, full schema evolution, etc [4 <https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution>]. Moreover, a separate Hudi connector would allow its independent evolution without having to worry about hacking/breaking the Hive connector. A separate connector also falls in line with our vision [5 <https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver>] when we think of a standalone timeline server or a lake cache to balance the tradeoff between writing and querying. Imagine users having read and write access to data and metadata in Hudi directly through Trino. I did some prototyping to get the snapshot queries on a Hudi COW table working with a new plugin [6 <https://github.com/codope/trino/tree/hudi-plugin>], and I feel the effort is worth it. High-level approach is to implement the connector SPI [7 <https://trino.io/docs/current/develop/connectors.html>] provided by Trino such as: a) HudiMetadata implements ConnectorMetadata to fetch table metadata. b) HudiSplit and HudiSplitManager implement ConnectorSplit and ConnectorSplitManager to produce logical units of data partitioning, so that Trino can parallelize reads and writes. Let me know your thoughts on the proposal. I can draft an RFC for the detailed design discussion once we have consensus. Regards, Sagar References: [1] https://github.com/prestodb/presto/commits?author=vinothchandar [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi [3] https://github.com/trinodb/trino/pull/9641 [4] https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution [5] https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver [6] https://github.com/codope/trino/tree/hudi-plugin [7] https://trino.io/docs/current/develop/connectors.html
