rdblue commented on code in PR #13445: URL: https://github.com/apache/iceberg/pull/13445#discussion_r2274536725
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java: ########## @@ -63,8 +64,9 @@ private SparkParquetWriters() {} @SuppressWarnings("unchecked") public static <T> ParquetValueWriter<T> buildWriter(StructType dfSchema, MessageType type) { + StructType writeSchema = PruneNullType.prune(dfSchema); return (ParquetValueWriter<T>) - ParquetWithSparkSchemaVisitor.visit(dfSchema, type, new WriteBuilder(type)); + ParquetWithSparkSchemaVisitor.visit(writeSchema, type, new WriteBuilder(type)); Review Comment: I don't think that this is correct because this creates writers by position with the unknown field removed, but the unknown field is still present in the incoming `InternalRow`. The tests added in this PR in `AvroDataTestBase` won't catch the problem because the unknown writer is at the end of the row. But if you had a column after the unknown field (like `trailing_data` in the [types test](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTestBase.java#L196)) then that field would read data from the unknown field and always write null. It may also fail if it tries to read a non-null value with a specific type. I think this issue raises a question about how this PR implements unknown handling by omitting a writer. To avoid the alignment issue with the fields in the actual `InternalRow` instances coming in, this needs to either skip the position or have a `NoopWriter`. I think the noop option is easiest, but I think the reason for pruning the `StructType` was probably to avoid failing in the visitor because the Spark type doesn't have the same number of fields as the Parquet type where unknowns have been removed ([here](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/ParquetWithSparkSchemaVisitor.java#L186-L187)). That issue is a direct consequence of not representing unknown fields in the Parquet type. Maybe we should rethink that decision and filter the Parquet schema later, like when creating a Parquet file. For now, we can probably work around the issue by updating the visitor logic to iterate over the fields and account for missing `NullType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org