hudi-agent commented on code in PR #18403:
URL: https://github.com/apache/hudi/pull/18403#discussion_r3186279309
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##########
@@ -133,6 +134,24 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath:
String,
}
}
+ /**
+ * Whether the requested schema contains any top-level BLOB columns. Used to
disable
+ * Lance batch mode for BLOB tables: the DESCRIPTOR-mode rewrite (and the
OUT_OF_LINE
+ * dataโnull contract) lives only in the row-path BlobDescriptorTransform,
and `supportBatch`
+ * cannot inspect read-time options (e.g. `hoodie.read.blob.inline.mode`)
since it runs at
+ * planning time. Forcing row mode whenever BLOB columns are present is the
simplest correct
+ * gate โ BLOB processing is per-row anyway (lazy byte materialization) so
the perf delta is
+ * negligible.
+ */
+ private def schemaContainsBlobColumn(schema: StructType): Boolean = {
+ schema.fields.exists { f =>
+ val md = f.metadata
+ md != null && md.contains(HoodieSchema.TYPE_METADATA_FIELD) &&
+
HoodieSchema.parseTypeDescriptor(md.getString(HoodieSchema.TYPE_METADATA_FIELD))
+ .getType == HoodieSchemaType.BLOB
+ }
+ }
+
/**
* Checks if the file format supports vectorized reading, please refer to
SPARK-40918.
*
Review Comment:
๐ค [Line 160] Nice fix for the BLOB case โ the comment here ("so the planner
doesn't commit to ColumnarBatch output that the row-path reader cannot
deliver") describes exactly the same risk that exists for the `hasTypeChanges`
runtime fallback in `SparkLanceReaderBase.read`. When a COW Lance table has
implicit schema evolution, `supportBatch` still returns true here (no BLOB),
but at runtime any file with type changes returns `Iterator[InternalRow]` while
Spark already expects `ColumnarBatch`. Worth extending this gate (e.g. also
disable when `internalSchemaOpt.isPresent`) to close that path too?
<sub><i>- AI-generated; verify before applying. React ๐/๐ to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]