Copilot commented on code in PR #6565:
URL: https://github.com/apache/hive/pull/6565#discussion_r3496769449


##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedDummyColumnReader.java:
##########
@@ -64,6 +68,41 @@ public void readBatch(int total, ColumnVector col, TypeInfo 
typeInfo) throws IOE
 
     if (typeInfo.getCategory() == ObjectInspector.Category.PRIMITIVE) {
       fillPrimitive(col, (PrimitiveTypeInfo) typeInfo, defaultValue);
+    } else if (typeInfo.getCategory() == ObjectInspector.Category.STRUCT) {
+      fillStruct(col, (StructTypeInfo) typeInfo, defaultValue);
+    } else {
+      throw new IOException("Unsupported type category in DummyColumnReader: " 
+ typeInfo.getCategory());
+    }

Review Comment:
   `defaultValue` can be a nested Map (struct defaults), but `readBatch()` 
always treats any non-null default as compatible with the requested `typeInfo`. 
If a nested field is missing from the Parquet file schema, the dummy reader can 
be built for that *nested* primitive and receive the *parent struct's* Map 
default, which will currently blow up with a `ClassCastException` in 
`fillPrimitive` (e.g., `(Number) value`). Consider defensively converting such 
type mismatches into a clear `IOException` (or treating it as null) to avoid 
hard-to-debug runtime failures.



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java:
##########
@@ -187,6 +188,12 @@ static Map<String, Object> 
getInitialColumnDefaults(List<Types.NestedField> colu
     for (Types.NestedField column : columns) {
       if (column.initialDefault() != null) {
         columnDefaults.put(column.name(), column.initialDefault());
+      } else if (column.type().isStructType()) {
+        Map<String, Object> structDefaults =
+            
HiveSchemaUtil.getStructInitialDefaults(column.type().asStructType());
+        if (!structDefaults.isEmpty()) {
+          columnDefaults.put(column.name(), structDefaults);
+        }

Review Comment:
   `getInitialColumnDefaults` now stores struct defaults as a nested `Map` 
under the top-level column name. The Parquet vectorized reader's dummy-path for 
*missing nested fields* looks up defaults using only `descriptor.getPath()[0]` 
(top-level column), so a missing nested primitive field can receive this Map 
and crash in `VectorizedDummyColumnReader.fillPrimitive`. This needs a 
path-aware default lookup (e.g., flatten defaults by full field path like 
`person.address.city`, or adjust the reader to drill into the struct default 
map for nested fields).



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergRecordReader.java:
##########
@@ -173,7 +174,16 @@ private CloseableIterable openGeneric(FileScanTask task, 
Schema readSchema) {
       default -> throw new UnsupportedOperationException(
           String.format("Cannot read %s file: %s", file.format().name(), 
file.location()));
     };
-    return applyResidualFiltering(iterable, residual, readSchema);
+    return applyResidualFiltering(withStructInitialDefaultBackfill(iterable, 
readSchema), residual, readSchema);
+  }
+
+  private CloseableIterable<T> 
withStructInitialDefaultBackfill(CloseableIterable<T> iterable, Schema 
readSchema) {
+    return CloseableIterable.transform(iterable, row -> {
+      if (row instanceof Record curIceRecord) {
+        HiveSchemaUtil.backfillStructInitialDefaults(curIceRecord, 
readSchema.columns());
+      }
+      return row;
+    });
   }

Review Comment:
   `withStructInitialDefaultBackfill` wraps every generic iterable and, for 
each row, scans *all* columns to find struct fields to backfill. For wide 
schemas this adds a noticeable per-row overhead even when the read schema has 
no struct fields (or no structs with nested `initialDefault`s). Consider 
short-circuiting once per reader if the schema contains no struct initial 
defaults, and return the original iterable unchanged in that case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to