wombatu-kun commented on PR #18958: URL: https://github.com/apache/hudi/pull/18958#issuecomment-4714263650
@cshuo This PR is write-only by design; materializing bytes (the read_blob path) is out of scope here and should be tracked on its own. On the read mechanism: a user-registered Flink `ScalarFunction` is technically sufficient for OUT_OF_LINE. Byte materialization needs only the reference fields (external_path/offset/length) plus a `HoodieStorage`/`StorageConfiguration` to open and seek the external file - no timeline or table context - which is exactly what the Spark side does today in `hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala`. So the OOL reference this PR writes is self-describing enough for a standalone UDF to resolve. What does not carry over is the planner rewrite. Spark gets batched/coalesced I/O by injecting `ReadBlobRule` through `SparkSessionExtensions` (`HoodieAnalysis.scala:192`); Flink has no equivalent resolution/planner-rule hook, so a per-row UDF would resolve each reference independently and lose that batching. Recovering it would mean manual buffering in the function or going through Flink's `Module` SPI for a built-in, which is heavier. Suggest filing a dedicated issue for the Flink read path so the current write-only state is an explicit intermediate step. @kbuci can confirm the intended direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
