wombatu-kun commented on PR #18958:
URL: https://github.com/apache/hudi/pull/18958#issuecomment-4714263650

   @cshuo This PR is write-only by design; materializing bytes (the read_blob 
path) is out of scope here and should be tracked on its own.
   
   On the read mechanism: a user-registered Flink `ScalarFunction` is 
technically sufficient for OUT_OF_LINE. Byte materialization needs only the 
reference fields (external_path/offset/length) plus a 
`HoodieStorage`/`StorageConfiguration` to open and seek the external file - no 
timeline or table context - which is exactly what the Spark side does today in 
`hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala`.
 So the OOL reference this PR writes is self-describing enough for a standalone 
UDF to resolve.
   
   What does not carry over is the planner rewrite. Spark gets 
batched/coalesced I/O by injecting `ReadBlobRule` through 
`SparkSessionExtensions` (`HoodieAnalysis.scala:192`); Flink has no equivalent 
resolution/planner-rule hook, so a per-row UDF would resolve each reference 
independently and lose that batching. Recovering it would mean manual buffering 
in the function or going through Flink's `Module` SPI for a built-in, which is 
heavier.
   
   Suggest filing a dedicated issue for the Flink read path so the current 
write-only state is an explicit intermediate step. @kbuci can confirm the 
intended direction.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to