Hi Sagar;

Thanks for the detailed write up. +1 on the separate connector in general.

I would love to understand few aspects which work really well for the Hive
connector path (which is kind of why we did it this way to begin with)

- whats the new user experience for users? With the hive plugin
integration, hudi tables can be queried like any hive table. This is very
nice and easy to get started. Can we provide a seamless experience, what
about existing tables?

- what are we giving up? Trino docs talk about caching etc that are built
into Hive connector?

- IMO we should retain the hive connector path as well. Most of the issues
we faced are because Hudi was adding transactions/snapshots which had no
good abstractions in Hive connector.

Thanks
Vinoth

On Sat, Oct 16, 2021 at 11:06 PM sagar sumit <sagarsumi...@gmail.com> wrote:

> Dear Hudi Community,
>
> I would like to propose the development of a new Trino plugin/connector for
> Hudi.
>
> Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> read-optimized queries on Merge-On-Read tables with Trino, through the
> input format based integration in the Hive connector [1
> <https://github.com/prestodb/presto/commits?author=vinothchandar>]. This
> approach has known performance limitations with very large tables, which
> has been since fixed on PrestoDB [2
> <https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are working
> on
> replicating the same fixes on Trino as well [3
> <https://github.com/trinodb/trino/pull/9641>].
>
> However, as Hudi keeps getting better, a new plugin to provide access to
> Hudi data and metadata will help in unlocking those capabilities for the
> Trino users. Just to name a few benefits, metadata-based listing, full
> schema evolution, etc [4
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >].
> Moreover, a separate Hudi connector would allow its independent evolution
> without having to worry about hacking/breaking the Hive connector.
>
> A separate connector also falls in line with our vision [5
> <
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >]
> when we think of a standalone timeline server or a lake cache to balance
> the tradeoff between writing and querying. Imagine users having read and
> write access to data and metadata in Hudi directly through Trino.
>
> I did some prototyping to get the snapshot queries on a Hudi COW table
> working with a new plugin [6
> <https://github.com/codope/trino/tree/hudi-plugin>], and I feel the effort
> is worth it. High-level approach is to implement the connector SPI [7
> <https://trino.io/docs/current/develop/connectors.html>] provided by Trino
> such as:
> a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> ConnectorSplitManager to produce logical units of data partitioning, so
> that Trino can parallelize reads and writes.
>
> Let me know your thoughts on the proposal. I can draft an RFC for the
> detailed design discussion once we have consensus.
>
> Regards,
> Sagar
>
> References:
> [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> [3] https://github.com/trinodb/trino/pull/9641
> [4]
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> [5]
>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> [6] https://github.com/codope/trino/tree/hudi-plugin
> [7] https://trino.io/docs/current/develop/connectors.html
>

Reply via email to