Re: [DISCUSS] Trino Plugin for Hudi

周康 Mon, 18 Oct 2021 06:06:31 -0700

+1 i have send a message on trino slack, really appreciate for the new trino 
plugin/connector.
https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200


looking forward to the RFC and more discussion

On 2021/10/17 06:06:09 sagar sumit wrote:
> Dear Hudi Community,
> 
> I would like to propose the development of a new Trino plugin/connector for
> Hudi.
> 
> Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> read-optimized queries on Merge-On-Read tables with Trino, through the
> input format based integration in the Hive connector [1
> <https://github.com/prestodb/presto/commits?author=vinothchandar>]. This
> approach has known performance limitations with very large tables, which
> has been since fixed on PrestoDB [2
> <https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are working on
> replicating the same fixes on Trino as well [3
> <https://github.com/trinodb/trino/pull/9641>].
> 
> However, as Hudi keeps getting better, a new plugin to provide access to
> Hudi data and metadata will help in unlocking those capabilities for the
> Trino users. Just to name a few benefits, metadata-based listing, full
> schema evolution, etc [4
> <https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution>].
> Moreover, a separate Hudi connector would allow its independent evolution
> without having to worry about hacking/breaking the Hive connector.
> 
> A separate connector also falls in line with our vision [5
> <https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver>]
> when we think of a standalone timeline server or a lake cache to balance
> the tradeoff between writing and querying. Imagine users having read and
> write access to data and metadata in Hudi directly through Trino.
> 
> I did some prototyping to get the snapshot queries on a Hudi COW table
> working with a new plugin [6
> <https://github.com/codope/trino/tree/hudi-plugin>], and I feel the effort
> is worth it. High-level approach is to implement the connector SPI [7
> <https://trino.io/docs/current/develop/connectors.html>] provided by Trino
> such as:
> a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> ConnectorSplitManager to produce logical units of data partitioning, so
> that Trino can parallelize reads and writes.
> 
> Let me know your thoughts on the proposal. I can draft an RFC for the
> detailed design discussion once we have consensus.
> 
> Regards,
> Sagar
> 
> References:
> [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> [3] https://github.com/trinodb/trino/pull/9641
> [4]
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> [5]
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> [6] https://github.com/codope/trino/tree/hudi-plugin
> [7] https://trino.io/docs/current/develop/connectors.html
>

Re: [DISCUSS] Trino Plugin for Hudi

Reply via email to