mgmarino commented on issue #12046:
URL: https://github.com/apache/iceberg/issues/12046#issuecomment-2612986973
After doing some further investigation, my initial conclusion is the
following:
- I can see `SerializableTableWithSize` being generated on the driver at
least in two different places:
- `org.apache.iceberg.spark.source.SparkWrite.createWriterFactory`:
https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L190
- `org.apache.iceberg.spark.source.SparkBatch.planInputPartitions`:
https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java#L78
where both tables are pointing to the same `FileIO` object (in this case
`S3FileIO`).
- If these jobs get submitted to the same Executor, on deserialization they
will still point to the *same* IO object, meaning that when one gets cleaned up
(and closed), it will affect the other.
I am not sure what a good solution is here, but I suspect that the FileIO
may need to be copied when creating the serializable table instead of what is
done now:
https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/core/src/main/java/org/apache/iceberg/SerializableTable.java#L123
Would love to get some input here!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]