This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
     new e5ae1115458c feat(trino): [RFC-105] Trino Hudi Connector — Shim/Bundle 
Refactor (#18782)
e5ae1115458c is described below

commit e5ae1115458c99696399b7df62d48f6c00764426
Author: Y Ethan Guo <[email protected]>
AuthorDate: Tue May 26 20:12:18 2026 -0700

    feat(trino): [RFC-105] Trino Hudi Connector — Shim/Bundle Refactor (#18782)
---
 rfc/rfc-105/rfc-105.md | 225 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 225 insertions(+)

diff --git a/rfc/rfc-105/rfc-105.md b/rfc/rfc-105/rfc-105.md
new file mode 100644
index 000000000000..51c7aef2fb90
--- /dev/null
+++ b/rfc/rfc-105/rfc-105.md
@@ -0,0 +1,225 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-105: Trino Hudi Connector — Shim/Bundle Refactor
+
+## Proposers
+
+- @yihua
+- @voonhous
+
+## Approvers
+
+- @codope
+- @vinothchandar
+
+## Status
+
+Issue: [HUDI-18780](https://github.com/apache/hudi/issues/18780)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Motivation
+
+The Trino-Hudi connector currently lives in `trinodb/trino` at 
`plugin/trino-hudi`. Maintaining and evolving the connector through the 
Trino-OSS-only path has stalled in practice, and the cost falls on Hudi users:
+
+- **Hudi-side improvement PRs to the Trino Hudi connector are not landing.** 
Four stacked PRs targeting the Trino Hudi connector were closed by Trino's 
stale-bot for lack of review:
+  - [trinodb/trino#28518](https://github.com/trinodb/trino/pull/28518)
+  - [trinodb/trino#28533](https://github.com/trinodb/trino/pull/28533)
+  - [trinodb/trino#28644](https://github.com/trinodb/trino/pull/28644)
+  - [trinodb/trino#28645](https://github.com/trinodb/trino/pull/28645)
+- **Significant Hudi-side work for the Trino connector is ready but cannot 
land** through the current path: metadata-table-driven partition listing, eight 
`HudiIndexSupport` strategies (column stats, partition stats, record-level, 
secondary, expression, bloom, bucket, partition bloom), MOR snapshot-isolation 
fixes (worker-side use of the latest commit time from the table handle), and 
file-system caching integration.
+- **The current arrangement does not scale.** Connector evolution must go 
through Trino-side review for every change, while the expertise and the 
source-of-truth for Hudi internals live in this project. Hudi releases cannot 
directly deliver improvements to Hudi users querying via Trino.
+
+Following alignment between the Hudi and Trino communities, the agreed 
direction is to split the connector into a thin Trino-side shim plus a 
Hudi-published artifact carrying the connector logic. This lets the Hudi 
project ship Trino-Hudi improvements with each Hudi release, while Trino picks 
them up via a one-line dependency-version bump.
+
+The single requirement carried over from the Trino side is that a 
comprehensive test suite for the connector continues to be maintained on the 
Trino side. This RFC documents the agreed approach and the implementation plan.
+
+## Abstract
+
+We split the Trino-Hudi connector into two Maven artifacts:
+
+1. **`io.trino:trino-hudi`** stays in Trino OSS (`plugin/trino-hudi`) as a 
thin shim — a `HudiPlugin` class that registers the `io.trino.spi.Plugin` SPI 
entry point — plus the test harness (smoke tests, query runners, MinIO-backed 
integration tests). This module mostly does not change once landed.
+2. **`org.apache.hudi:hudi-trino`** is a new Hudi-published Maven artifact 
(regular, non-shaded JAR) containing the actual connector logic at 
`io.trino.plugin.hudi.*` — `HudiConnectorFactory`, `HudiConnector`, 
`HudiMetadata`, `HudiSplitManager`, `HudiPageSourceProvider`, all index-support 
strategies, the `HoodieStorage`/`HoodieIOFactory` bridges to Trino's 
filesystem, etc. The artifact is built against the latest Trino release's SPI; 
it declares `hudi-common`, `hudi-io`, etc. as transiti [...]
+
+The first publication ships in **Hudi 1.3.0**. The Trino-side shim PR pins 
`org.apache.hudi:hudi-trino:1.3.0`. Going forward, all Trino-Hudi connector 
evolution happens in Hudi OSS; Trino picks up changes by bumping the dependency 
version. To support this integration model, **Hudi will increase its release 
cadence**.
+
+## Background
+
+### State of the Trino-side connector today
+
+`plugin/trino-hudi` in `trinodb/trino` is the baseline: it implements the 
standard Trino SPI (`Plugin`, `ConnectorFactory`, `Connector`, 
`ConnectorMetadata`, `ConnectorSplitManager`, `ConnectorPageSourceProvider`, 
etc.), depends on `hudi-common` and `hudi-io`, and uses Hudi's `HoodieStorage` 
abstraction (RFC-74) over Trino's `TrinoFileSystem`. No direct Hadoop imports.
+
+### State of the Hudi-side `hudi-trino-plugin` work
+
+A more advanced version of the connector exists in Hudi-side branches under 
`hudi-trino-plugin/` (same `io.trino.plugin.hudi.*` package, built against a 
recent Trino release). On top of the Trino-OSS baseline it adds:
+
+- Eight `HudiIndexSupport` strategies (column stats, partition stats, 
record-level, secondary, expression, bloom, bucket, partition bloom) for file- 
and partition-level pruning via metadata tables.
+- Metadata-table-driven partition discovery (async, resumable).
+- MOR record-level merging via `HoodieFileGroupReader` 
(`HudiTrinoReaderContext`).
+- Lazy commit-time on `HudiTableHandle` for snapshot-isolated MOR reads across 
workers.
+- Background, weighted split generation; size-based split weighting; 
multi-reader routing (`HudiPageSource` for MOR, `HudiBaseFileOnlyPageSource` 
for COW/RO).
+- File-system cache integration.
+- HoodieStorage / HoodieIOFactory bridges over `TrinoFileSystem` 
(`HudiTrinoStorage`, `HudiTrinoInlineStorage`, `HudiTrinoIOFactory`).
+
+This is the body of code that will move into the `hudi-trino` Maven module on 
the Hudi side.
+
+### Why a "shim + Hudi-published artifact" pattern
+
+This pattern decouples Trino-Hudi connector evolution from the Trino-side 
release cycle:
+
+- The Hudi project can publish Trino-Hudi improvements with each Hudi release, 
without waiting for Trino-side reviews of every change.
+- The Trino-side surface shrinks to a stable plugin-registration shim, so 
Trino-side review burden is minimal — typically a one-line version bump per 
Hudi release.
+- All Hudi-Trino integration code (`io.trino.plugin.hudi.*`) is co-located 
with the Hudi core libraries it depends on. Changes that cross the 
Hudi-internal / connector boundary can land atomically.
+- The artifact is **purpose-built for Trino** and implements Trino's SPI 
directly, so no intermediate adapter layer is needed between the published 
artifact and the Trino plugin.
+
+Trino's `trino-spi` is governed by `revapi-maven-plugin` (see 
`core/trino-spi/pom.xml`) which enforces backward compatibility on the SPI 
surface. This is what makes a single `hudi-trino` artifact targeting the latest 
Trino release viable across multiple subsequent Trino releases.
+
+Trino loads each plugin in an isolated `URLClassLoader`. Transitive 
dependencies of `hudi-trino` (Avro, Parquet, etc.) are isolated to the plugin's 
classloader and cannot conflict with other plugins.
+
+## Implementation
+
+### Architecture
+
+```
+trinodb/trino : plugin/trino-hudi   (packaging = trino-plugin)
+    HudiPlugin.java       ← thin shim: trivial Plugin SPI registration
+    META-INF/services/io.trino.spi.Plugin
+    src/test/java/...     ← full Trino-side test suite
+    pom.xml               ← depends on org.apache.hudi:hudi-trino:1.3.0
+                                       │
+                                       │  Maven Central
+                                       ▼
+apache/hudi : hudi-trino-plugin/    (Maven profile -Phudi-trino,
+                                     excluded from default reactor,
+                                     JDK 25 required)
+    io.trino.plugin.hudi.*           ← all connector logic:
+        HudiConnectorFactory, HudiConnector, HudiMetadata,
+        HudiSplitManager, HudiPageSourceProvider,
+        cache/, file/, io/, partition/,
+        query/ (incl. 8 index-support strategies),
+        reader/, split/, stats/, storage/, util/
+    src/test/java/...                 ← full duplicated + expanded suite
+  Published as: org.apache.hudi:hudi-trino:1.3.0
+```
+
+### What lives where
+
+#### Trino-side `plugin/trino-hudi` (the shim)
+
+| File | Purpose |
+|---|---|
+| `src/main/java/io/trino/plugin/hudi/HudiPlugin.java` | Implements 
`io.trino.spi.Plugin`. Single method returning `new HudiConnectorFactory()` 
(from the `hudi-trino` artifact). ~10 lines. |
+| `src/main/resources/META-INF/services/io.trino.spi.Plugin` | Service-loader 
pointer to `io.trino.plugin.hudi.HudiPlugin`. |
+| `pom.xml` | `<packaging>trino-plugin</packaging>`; pins 
`org.apache.hudi:hudi-trino:<version>`; SPI deps as `provided`. |
+| `src/test/java/...` | All current Trino-side tests stay: `HudiQueryRunner`, 
`TestHudiSmokeTest`, `TestHudiMinioConnectorSmokeTest`, 
`TestHudiConnectorTest`, `TestHudiSharedMetastore`, `TestHudiSystemTables`, 
`TestHudiPlugin`, `TestHudiConfig`, plus data initializers. Required by the 
Trino-side test-coverage commitment. |
+
+#### Hudi-side `hudi-trino-plugin/` (the engine)
+
+Everything else from the current `hudi-trino-plugin/` work, organized exactly 
as it is today:
+
+| Subpackage | Responsibility |
+|---|---|
+| `io.trino.plugin.hudi` | `HudiConnectorFactory`, `HudiConnector`, 
`HudiMetadata`, `HudiSplitManager`, `HudiPageSourceProvider`, `HudiSplit`, 
`HudiTableHandle`, `HudiModule`, `HudiConfig`, `HudiSessionProperties`, 
`HudiTableProperties`, `HudiTransactionManager`, `HudiMetadataFactory`. |
+| `.cache` | `HudiCacheKeyProvider` for file-system cache integration. |
+| `.file` | `HudiBaseFile`, `HudiLogFile`, file metadata abstractions. |
+| `.io` | `HudiTrinoIOFactory` (extends `HoodieIOFactory`), 
`HudiTrinoFileReaderFactory`, `TrinoSeekableDataInputStream`. |
+| `.partition` | `HudiPartitionInfo`, `HiveHudiPartitionInfo`, 
`HudiPartitionInfoLoader` (async resumable task). |
+| `.query` | `HudiDirectoryLister`, `HudiReadOptimizedDirectoryLister`, 
`HudiSnapshotDirectoryLister`; `query.index` package with 8 `HudiIndexSupport` 
strategies. |
+| `.reader` | `HudiTrinoReaderContext extends 
HoodieReaderContext<IndexedRecord>` for MOR record merging. |
+| `.split` | `HudiSplitFactory`, `HudiBackgroundSplitLoader`, 
`HudiSplitSource`, `HudiSplitWeightProvider`, `SizeBasedSplitWeightProvider`. |
+| `.stats` | `HudiTableStatistics`, `TableStatisticsReader`. |
+| `.storage` | `HudiTrinoStorage` (extends `HoodieStorage`), 
`HudiTrinoInlineStorage`, `TrinoStorageConfiguration`. |
+| `.util` | Serialization helpers, column synthesis, tuple-domain conversion, 
table-type utilities. |
+
+### API boundary
+
+The boundary between the shim and the published artifact is **Trino's SPI 
itself** — no intermediate API layer is introduced.
+
+- **Shim → artifact:** `HudiPlugin.getConnectorFactories()` returns `new 
HudiConnectorFactory()` defined in the artifact. Trino's runtime then calls 
`factory.create(catalogName, config, context)`. The `ConnectorContext` argument 
carries everything the artifact needs — `TypeManager`, `NodeManager`, 
`MetadataProvider`, `PageSorter`, `PageIndexerFactory`, `OpenTelemetry`, 
`Tracer`, `CatalogHandle` — without the artifact importing implementation 
classes.
+- **Artifact → Trino:** the artifact's `HudiConnector` exposes the standard 
SPI providers (`ConnectorMetadata`, `ConnectorSplitManager`, 
`ConnectorPageSourceProvider`, etc.). Trino calls these. Classloader context is 
handled by the standard `ClassLoaderSafe*` wrappers 
(`io.trino.plugin.base.classloader.*`) — already used today.
+
+### Maven dependencies for `hudi-trino`
+
+- **`compile`:** Hudi libs (`hudi-common`, `hudi-io`, `hudi-hive-sync`, 
`hudi-sync-common`) and Trino libs (`trino-filesystem`, `trino-hive`, 
`trino-metastore`, `trino-parquet`, `trino-cache`), Guice, Airlift, Caffeine.
+- **`provided`:** `trino-spi`, `slice`, Jackson, OpenTelemetry API, JOL 
(supplied by Trino at runtime).
+- **`runtime`:** log-manager, Dropwizard metrics, OpenTelemetry SDK, 
`trino-hive-formats`.
+- **`test`:** Trino testing libs (`trino-testing`, `trino-main`, 
`trino-testing-containers`, `trino-hdfs`), AssertJ, JUnit 5, Hudi test JARs.
+
+**Version alignment policy.** Trino versions are authoritative for shared 
libraries (Avro, Parquet, Jackson, Airlift). The `hudi-trino` POM pins these 
via `<dependencyManagement>` to whatever the targeted Trino release uses. If 
Hudi internals need a newer version, the fix is on the Hudi side or via a 
Trino-version bump — never by shipping divergent classpath versions.
+
+### Build target on Hudi side
+
+Trino requires Java 25, while the rest of Hudi targets a lower Java floor. 
`hudi-trino-plugin` therefore lives behind a Maven profile (`-Phudi-trino`) and 
is **excluded from the default `mvn install` reactor**:
+
+```xml
+<profile>
+  <id>hudi-trino</id>
+  <modules>
+    <module>hudi-trino-plugin</module>
+  </modules>
+</profile>
+```
+
+Default build (`mvn install`) skips it; Trino-targeted build (`mvn install 
-Phudi-trino`) requires JDK 25.
+
+### CI
+
+Two new GitHub Actions on the Hudi side, required for any change touching 
`hudi-trino-plugin/**`:
+
+1. **`hudi-trino-ci.yml`** — runs the full test suite via `mvn verify 
-Phudi-trino` on JDK 25. Catches regressions before they ship in a Hudi release.
+2. **`hudi-trino-compat.yml`** — nightly: pulls latest `trinodb/trino` master 
and latest `apache/hudi` master, builds Trino's relevant modules, then compiles 
`hudi-trino-plugin` against them and runs the `hudi-trino-plugin` test suite. 
Catches both SPI drift and behavioral incompatibilities before the next Trino 
release.
+
+On the Trino side, existing CI continues to build and test 
`plugin/trino-hudi`, exercising the published `hudi-trino` artifact end-to-end 
on every Trino PR.
+
+### Test strategy
+
+**Full test duplication.** The Trino-side smoke tests (`TestHudiSmokeTest`, 
`TestHudiMinioConnectorSmokeTest`, `TestHudiConnectorTest`, etc.) are mirrored 
on the Hudi side and additionally extended.
+
+- **Trino side runs them** on every Trino PR — fulfilling the Trino-side 
test-coverage commitment.
+- **Hudi side runs them** on every Hudi PR touching `hudi-trino-plugin` — so 
Hudi contributors catch regressions before they ship in a Hudi release. The 
Hudi-side suite is also **expanded** with more granular unit tests covering 
split generation edge cases, all eight index-support strategies, the MOR 
record-merging path, lazy-commit-time snapshot isolation, and the cache-key 
provider.
+
+This duplication has a known cost — two places to update when adding tests — 
but is the right trade-off given:
+- The Trino-side suite must remain comprehensive as agreed with the Trino 
community.
+- Hudi-side contributors need fast feedback without waiting for a Trino-side 
PR cycle.
+
+### Risks & caveats
+
+- **Trino SPI drift.** A future Trino SPI change could break the pre-built 
`hudi-trino` artifact at runtime. Mitigation: the nightly compat CI compiles 
and runs the `hudi-trino-plugin` test suite against Trino master, flagging both 
compile-time and behavioral incompatibilities before a Trino release ships.
+- **Avro / Parquet / Jackson version skew.** Resolved by policy: Trino's 
versions are authoritative, pinned via `<dependencyManagement>` in the 
`hudi-trino` POM. Hudi-side fixes or Trino-version bumps adjust to it.
+- **Test-infrastructure coupling.** `hudi-trino-plugin`'s test scope depends 
on `trino-testing`, `trino-main`, etc., coupling the Hudi build to Trino 
artifacts on Maven Central. Acceptable cost.
+- **Release coordination.** A critical fix in `hudi-trino` ships only via a 
Hudi release. Mitigation: keep the Trino-side shim trivial so virtually all 
fixes can land in `hudi-trino`, and increase Hudi release cadence.
+- **License / ASF process.** Cross-project releases between two ASF projects; 
covered by standard PMC announcements at first release.
+
+## Rollout/Adoption Plan
+
+**Step 1 — Hudi 1.3.0 publishes `hudi-trino`.** Land the `hudi-trino-plugin` 
work in `apache/hudi` master behind the `-Phudi-trino` profile, land the two CI 
workflows, then publish `org.apache.hudi:hudi-trino:1.3.0` to Maven Central as 
part of the 1.3.0 release. Hudi commits to a more frequent release cadence 
going forward: to start, a major release roughly every month, with more 
frequent minor releases to stabilize this module. A `hudi-trino` artifact is 
cut whenever there are enough im [...]
+
+**Step 2 — Trino-side shim PR.** A small PR against `trinodb/trino` that 
replaces the contents of 
`plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/` with a single 
`HudiPlugin.java`, keeps `META-INF/services/io.trino.spi.Plugin` and all 
current tests, and adds `org.apache.hudi:hudi-trino:1.3.0` as a `compile` 
dependency. The PR is small by design — deletes connector code, points at the 
published artifact — so Trino-maintainer review burden is minimal.
+
+**Step 3 — Steady state.** Trino-Hudi feature work and bug fixes happen on the 
Hudi side. Each Hudi release publishes a new `hudi-trino` artifact. Trino picks 
up changes by bumping the pinned version — a one-line PR per release. Because 
Trino pins `hudi-trino` as a compile-time dependency, the `hudi-trino` ↔ Trino 
release mapping is **one-to-one**: each `hudi-trino` version is the version 
that ships embedded in a specific Trino release. This one-to-one mapping makes 
feature support and b [...]
+
+**Impact on existing users.** No behavioral change: same `HudiPlugin` 
registration, same catalog config. The first Trino release picking up 
`hudi-trino:1.3.0` gains the features that the previously-stalled PRs covered 
(metadata-table partition listing, index support, MOR snapshot-isolation 
correctness, file-system caching, advanced split generation). No migration 
tools needed.
+
+## Test Plan
+
+The RFC is validated when:
+
+- [ ] `hudi-trino-plugin` builds and its full test suite passes on the Hudi 
side via `mvn verify -Phudi-trino` (JDK 25), covering smoke tests, 
MinIO/Alluxio caching tests, MOR/COW tests, page source tests, split-factory 
tests, index support tests, system-table tests, and plugin/config tests.
+- [ ] The `hudi-trino-compat.yml` workflow compiles `hudi-trino-plugin` 
against `trinodb/trino` master successfully.
+- [ ] `org.apache.hudi:hudi-trino:1.3.0` is published to Maven Central.
+- [ ] The Trino-side shim PR is green: Trino's CI for `plugin/trino-hudi` 
passes against the published `1.3.0` artifact, including MinIO/S3 integration 
and plugin-loading tests.
+- [ ] At least one subsequent Hudi-side patch release exercises the "bump 
version in Trino" steady-state flow end-to-end.

Reply via email to