Re: [DISCUSS] Trino Plugin for Hudi

Jian Feng Wed, 20 Oct 2021 04:18:43 -0700

When can Trino support snapshot queries on the Merge-on-read table?

On Mon, Oct 18, 2021 at 9:06 PM 周康 <[email protected]> wrote:


> +1 i have send a message on trino slack, really appreciate for the new
> trino plugin/connector.
> https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
>
> looking forward to the RFC and more discussion
>
> On 2021/10/17 06:06:09 sagar sumit wrote:
> > Dear Hudi Community,
> >
> > I would like to propose the development of a new Trino plugin/connector
> for
> > Hudi.
> >
> > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> > read-optimized queries on Merge-On-Read tables with Trino, through the
> > input format based integration in the Hive connector [1
> > <https://github.com/prestodb/presto/commits?author=vinothchandar>]. This
> > approach has known performance limitations with very large tables, which
> > has been since fixed on PrestoDB [2
> > <https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are
> working on
> > replicating the same fixes on Trino as well [3
> > <https://github.com/trinodb/trino/pull/9641>].
> >
> > However, as Hudi keeps getting better, a new plugin to provide access to
> > Hudi data and metadata will help in unlocking those capabilities for the
> > Trino users. Just to name a few benefits, metadata-based listing, full
> > schema evolution, etc [4
> > <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >].
> > Moreover, a separate Hudi connector would allow its independent evolution
> > without having to worry about hacking/breaking the Hive connector.
> >
> > A separate connector also falls in line with our vision [5
> > <
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >]
> > when we think of a standalone timeline server or a lake cache to balance
> > the tradeoff between writing and querying. Imagine users having read and
> > write access to data and metadata in Hudi directly through Trino.
> >
> > I did some prototyping to get the snapshot queries on a Hudi COW table
> > working with a new plugin [6
> > <https://github.com/codope/trino/tree/hudi-plugin>], and I feel the
> effort
> > is worth it. High-level approach is to implement the connector SPI [7
> > <https://trino.io/docs/current/develop/connectors.html>] provided by
> Trino
> > such as:
> > a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> > ConnectorSplitManager to produce logical units of data partitioning, so
> > that Trino can parallelize reads and writes.
> >
> > Let me know your thoughts on the proposal. I can draft an RFC for the
> > detailed design discussion once we have consensus.
> >
> > Regards,
> > Sagar
> >
> > References:
> > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> > [3] https://github.com/trinodb/trino/pull/9641
> > [4]
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> > [5]
> >
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> > [6] https://github.com/codope/trino/tree/hudi-plugin
> > [7] https://trino.io/docs/current/develop/connectors.html
> >
>


-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure

Re: [DISCUSS] Trino Plugin for Hudi

Reply via email to