Re: [PR] feat(flink): Support writing out-of-line BLOB columns [hudi]

via GitHub Mon, 15 Jun 2026 19:25:29 -0700


wombatu-kun commented on PR #18958:
URL: https://github.com/apache/hudi/pull/18958#issuecomment-4714263650

@cshuo This PR is write-only by design; materializing bytes (the read_blob
path) is out of scope here and should be tracked on its own.

On the read mechanism: a user-registered Flink `ScalarFunction` is
technically sufficient for OUT_OF_LINE. Byte materialization needs only the
reference fields (external_path/offset/length) plus a
`HoodieStorage`/`StorageConfiguration` to open and seek the external file - no
timeline or table context - which is exactly what the Spark side does today in
`hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala`.
So the OOL reference this PR writes is self-describing enough for a standalone
UDF to resolve.

What does not carry over is the planner rewrite. Spark gets
batched/coalesced I/O by injecting `ReadBlobRule` through
`SparkSessionExtensions` (`HoodieAnalysis.scala:192`); Flink has no equivalent
resolution/planner-rule hook, so a per-row UDF would resolve each reference
independently and lose that batching. Recovering it would mean manual buffering
in the function or going through Flink's `Module` SPI for a built-in, which is
heavier.

Suggest filing a dedicated issue for the Flink read path so the current
write-only state is an explicit intermediate step. @kbuci can confirm the
intended direction.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(flink): Support writing out-of-line BLOB columns [hudi]

Reply via email to