Dear Hudi Community,

I would like to propose the development of a new Trino plugin/connector for
Hudi.

Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
read-optimized queries on Merge-On-Read tables with Trino, through the
input format based integration in the Hive connector [1
<https://github.com/prestodb/presto/commits?author=vinothchandar>]. This
approach has known performance limitations with very large tables, which
has been since fixed on PrestoDB [2
<https://prestodb.io/blog/2020/08/04/prestodb-and-hudi>]. We are working on
replicating the same fixes on Trino as well [3
<https://github.com/trinodb/trino/pull/9641>].

However, as Hudi keeps getting better, a new plugin to provide access to
Hudi data and metadata will help in unlocking those capabilities for the
Trino users. Just to name a few benefits, metadata-based listing, full
schema evolution, etc [4
<https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution>].
Moreover, a separate Hudi connector would allow its independent evolution
without having to worry about hacking/breaking the Hive connector.

A separate connector also falls in line with our vision [5
<https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver>]
when we think of a standalone timeline server or a lake cache to balance
the tradeoff between writing and querying. Imagine users having read and
write access to data and metadata in Hudi directly through Trino.

I did some prototyping to get the snapshot queries on a Hudi COW table
working with a new plugin [6
<https://github.com/codope/trino/tree/hudi-plugin>], and I feel the effort
is worth it. High-level approach is to implement the connector SPI [7
<https://trino.io/docs/current/develop/connectors.html>] provided by Trino
such as:
a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
b) HudiSplit and HudiSplitManager implement ConnectorSplit and
ConnectorSplitManager to produce logical units of data partitioning, so
that Trino can parallelize reads and writes.

Let me know your thoughts on the proposal. I can draft an RFC for the
detailed design discussion once we have consensus.

Regards,
Sagar

References:
[1] https://github.com/prestodb/presto/commits?author=vinothchandar
[2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
[3] https://github.com/trinodb/trino/pull/9641
[4]
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
[5]
https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
[6] https://github.com/codope/trino/tree/hudi-plugin
[7] https://trino.io/docs/current/develop/connectors.html

Reply via email to