Copilot commented on code in PR #6565:
URL: https://github.com/apache/hive/pull/6565#discussion_r3496769449
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedDummyColumnReader.java:
##########
@@ -64,6 +68,41 @@ public void readBatch(int total, ColumnVector col, TypeInfo
typeInfo) throws IOE
if (typeInfo.getCategory() == ObjectInspector.Category.PRIMITIVE) {
fillPrimitive(col, (PrimitiveTypeInfo) typeInfo, defaultValue);
+ } else if (typeInfo.getCategory() == ObjectInspector.Category.STRUCT) {
+ fillStruct(col, (StructTypeInfo) typeInfo, defaultValue);
+ } else {
+ throw new IOException("Unsupported type category in DummyColumnReader: "
+ typeInfo.getCategory());
+ }
Review Comment:
`defaultValue` can be a nested Map (struct defaults), but `readBatch()`
always treats any non-null default as compatible with the requested `typeInfo`.
If a nested field is missing from the Parquet file schema, the dummy reader can
be built for that *nested* primitive and receive the *parent struct's* Map
default, which will currently blow up with a `ClassCastException` in
`fillPrimitive` (e.g., `(Number) value`). Consider defensively converting such
type mismatches into a clear `IOException` (or treating it as null) to avoid
hard-to-debug runtime failures.
##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java:
##########
@@ -187,6 +188,12 @@ static Map<String, Object>
getInitialColumnDefaults(List<Types.NestedField> colu
for (Types.NestedField column : columns) {
if (column.initialDefault() != null) {
columnDefaults.put(column.name(), column.initialDefault());
+ } else if (column.type().isStructType()) {
+ Map<String, Object> structDefaults =
+
HiveSchemaUtil.getStructInitialDefaults(column.type().asStructType());
+ if (!structDefaults.isEmpty()) {
+ columnDefaults.put(column.name(), structDefaults);
+ }
Review Comment:
`getInitialColumnDefaults` now stores struct defaults as a nested `Map`
under the top-level column name. The Parquet vectorized reader's dummy-path for
*missing nested fields* looks up defaults using only `descriptor.getPath()[0]`
(top-level column), so a missing nested primitive field can receive this Map
and crash in `VectorizedDummyColumnReader.fillPrimitive`. This needs a
path-aware default lookup (e.g., flatten defaults by full field path like
`person.address.city`, or adjust the reader to drill into the struct default
map for nested fields).
##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergRecordReader.java:
##########
@@ -173,7 +174,16 @@ private CloseableIterable openGeneric(FileScanTask task,
Schema readSchema) {
default -> throw new UnsupportedOperationException(
String.format("Cannot read %s file: %s", file.format().name(),
file.location()));
};
- return applyResidualFiltering(iterable, residual, readSchema);
+ return applyResidualFiltering(withStructInitialDefaultBackfill(iterable,
readSchema), residual, readSchema);
+ }
+
+ private CloseableIterable<T>
withStructInitialDefaultBackfill(CloseableIterable<T> iterable, Schema
readSchema) {
+ return CloseableIterable.transform(iterable, row -> {
+ if (row instanceof Record curIceRecord) {
+ HiveSchemaUtil.backfillStructInitialDefaults(curIceRecord,
readSchema.columns());
+ }
+ return row;
+ });
}
Review Comment:
`withStructInitialDefaultBackfill` wraps every generic iterable and, for
each row, scans *all* columns to find struct fields to backfill. For wide
schemas this adds a noticeable per-row overhead even when the read schema has
no struct fields (or no structs with nested `initialDefault`s). Consider
short-circuiting once per reader if the schema contains no struct initial
defaults, and return the original iterable unchanged in that case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]