Re: [PR] Spark: Read/Write `UnknownType` [iceberg]

via GitHub Wed, 13 Aug 2025 13:19:58 -0700


rdblue commented on code in PR #13445:
URL: https://github.com/apache/iceberg/pull/13445#discussion_r2274536725



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java:
##########
@@ -63,8 +64,9 @@ private SparkParquetWriters() {}
 
   @SuppressWarnings("unchecked")
   public static <T> ParquetValueWriter<T> buildWriter(StructType dfSchema, 
MessageType type) {
+    StructType writeSchema = PruneNullType.prune(dfSchema);
     return (ParquetValueWriter<T>)
-        ParquetWithSparkSchemaVisitor.visit(dfSchema, type, new 
WriteBuilder(type));
+        ParquetWithSparkSchemaVisitor.visit(writeSchema, type, new 
WriteBuilder(type));

Review Comment:
   I don't think that this is correct because this creates writers by position 
with the unknown field removed, but the unknown field is still present in the 
incoming `InternalRow`. The tests added in this PR in `AvroDataTestBase` won't 
catch the problem because the unknown writer is at the end of the row. But if 
you had a column after the unknown field (like `trailing_data` in the [types 
test](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTestBase.java#L196))
 then that field would read data from the unknown field and always write null. 
It may also fail if it tries to read a non-null value with a specific type.
   
   I think this issue raises a question about how this PR implements unknown 
handling by omitting a writer. To avoid the alignment issue with the fields in 
the actual `InternalRow` instances coming in, this needs to either skip the 
position or have a `NoopWriter`. I think the noop option is easiest, but I 
think the reason for pruning the `StructType` was probably to avoid failing in 
the visitor because the Spark type doesn't have the same number of fields as 
the Parquet type where unknowns have been removed 
([here](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/ParquetWithSparkSchemaVisitor.java#L186-L187)).
   
   That issue is a direct consequence of not representing unknown fields in the 
Parquet type. Maybe we should rethink that decision and filter the Parquet 
schema later, like when creating a Parquet file. For now, we can probably work 
around the issue by updating the visitor logic to iterate over the fields and 
account for missing `NullType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Read/Write `UnknownType` [iceberg]

Reply via email to