Re: [PR] fix(lance): prevent file splitting for Lance base files to avoid duplicate reads [hudi]

via GitHub Sun, 03 May 2026 09:47:43 -0700


yihua commented on code in PR #18678:
URL: https://github.com/apache/hudi/pull/18678#discussion_r3178430212



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##########
@@ -220,7 +220,8 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath: 
String,
     // This will enable us to take advantage of spark's file splitting 
capability.
     // For overly large single files, we can use multiple concurrent tasks to 
read them, thereby reducing the overall job reading time consumption
     val superSplitable = super.isSplitable(sparkSession, options, path)
-    val splitable = !isMOR && !isIncremental && !isBootstrap && superSplitable
+    val isLance = hoodieFileFormat == HoodieFileFormat.LANCE
+    val splitable = !isMOR && !isIncremental && !isBootstrap && !isLance && 
superSplitable

Review Comment:
   Could we follow up to revisit all such patterns on the hardcoded 
format-related condition?  It should be made pluggable through file format 
adapter, so that adding a new file format should only change the adapter 
implementation instead of every such condition in different places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(lance): prevent file splitting for Lance base files to avoid duplicate reads [hudi]

Reply via email to