+1 i have send a message on trino slack, really appreciate for the new trino plugin/connector. https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
looking forward to the RFC and more discussion On 2021/10/17 06:06:09 sagar sumit wrote: > Dear Hudi Community, > > I would like to propose the development of a new Trino plugin/connector for > Hudi. > > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and > read-optimized queries on Merge-On-Read tables with Trino, through the > input format based integration in the Hive connector [1 > <https://github.com/prestodb/presto/commits?author=vinothchandar>]. This > approach has known performance limitations with very large tables, which > has been since fixed on PrestoDB [2 > <https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are working on > replicating the same fixes on Trino as well [3 > <https://github.com/trinodb/trino/pull/9641>]. > > However, as Hudi keeps getting better, a new plugin to provide access to > Hudi data and metadata will help in unlocking those capabilities for the > Trino users. Just to name a few benefits, metadata-based listing, full > schema evolution, etc [4 > <https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution>]. > Moreover, a separate Hudi connector would allow its independent evolution > without having to worry about hacking/breaking the Hive connector. > > A separate connector also falls in line with our vision [5 > <https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver>] > when we think of a standalone timeline server or a lake cache to balance > the tradeoff between writing and querying. Imagine users having read and > write access to data and metadata in Hudi directly through Trino. > > I did some prototyping to get the snapshot queries on a Hudi COW table > working with a new plugin [6 > <https://github.com/codope/trino/tree/hudi-plugin>], and I feel the effort > is worth it. High-level approach is to implement the connector SPI [7 > <https://trino.io/docs/current/develop/connectors.html>] provided by Trino > such as: > a) HudiMetadata implements ConnectorMetadata to fetch table metadata. > b) HudiSplit and HudiSplitManager implement ConnectorSplit and > ConnectorSplitManager to produce logical units of data partitioning, so > that Trino can parallelize reads and writes. > > Let me know your thoughts on the proposal. I can draft an RFC for the > detailed design discussion once we have consensus. > > Regards, > Sagar > > References: > [1] https://github.com/prestodb/presto/commits?author=vinothchandar > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi > [3] https://github.com/trinodb/trino/pull/9641 > [4] > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution > [5] > https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver > [6] https://github.com/codope/trino/tree/hudi-plugin > [7] https://trino.io/docs/current/develop/connectors.html >