Chehai opened a new issue, #13674: URL: https://github.com/apache/iceberg/issues/13674
### Apache Iceberg version 1.7.1 ### Query engine Spark ### Please describe the bug 🐞 Hello, I'd like to get some help on compaction OOM in Spark. Calling `rewrite_data_files` on a relatively large partition in an AWS Glue 5.0 (Spark 3.5.4 and Iceberg 1.7.1) Spark job with 2 R.8X workers (256GB each) always gets java.lang.OutOfMemoryError. This issue is similar to a closed issue https://github.com/apache/iceberg/issues/10054 ```sql CALL system.rewrite_data_files( table => 'some_db.some_table', where => "(partition_id = 'some_partition_id')", strategy => 'binpack', options => map( 'partial-progress.enabled','true', 'rewrite-job-order','bytes-asc', 'target-file-size-bytes','134217728', 'partial-progress.max-commits','50', 'partial-progress.max-failed-commits','1000', 'max-file-group-size-bytes','536870912', 'max-concurrent-file-group-rewrites','1', 'min-input-files','10' ) ) ``` Partition: ```Python partition=Row(Row(partition_id='some_partition_id'), spec_id=0, record_count=45984621, file_count=589, position_delete_record_count=5, position_delete_file_count=271, equality_delete_record_count=17, equality_delete_file_count=585) ``` Stacktrace: ``` WARN 2025-07-24T20:57:01,498 226477 org.apache.spark.scheduler.TaskSetManager [task-result-getter-1] 72 Lost task 2.0 in stage 1.0 (TID 3) (172.34.121.38 executor 1): java.lang.OutOfMemoryError: Java heap space at java.base/java.util.HashMap.resize(HashMap.java:702) at java.base/java.util.HashMap.putVal(HashMap.java:661) at java.base/java.util.HashMap.put(HashMap.java:610) at java.base/java.util.HashSet.add(HashSet.java:221) at org.apache.iceberg.util.StructLikeSet.add(StructLikeSet.java:102) at org.apache.iceberg.util.StructLikeSet.add(StructLikeSet.java:32) at org.apache.iceberg.relocated.com.google.common.collect.Iterators.addAll(Iterators.java:366) at org.apache.iceberg.relocated.com.google.common.collect.Iterables.addAll(Iterables.java:333) at org.apache.iceberg.data.BaseDeleteLoader.loadEqualityDeletes(BaseDeleteLoader.java:110) at org.apache.iceberg.data.DeleteFilter.applyEqDeletes(DeleteFilter.java:190) at org.apache.iceberg.data.DeleteFilter.eqDeletedRowFilter(DeleteFilter.java:220) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.applyEqDelete(ColumnarBatchReader.java:230) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.loadDataToColumnBatch(ColumnarBatchReader.java:104) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:72) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:44) at org.apache.iceberg.parquet.VectorizedParquetReader$CachedFileIterator.next(VectorizedParquetReader.java:272) at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:171) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1$$Lambda$1336/0x00007f6230b92000.apply(Unknown Source) at scala.Option.exists(Option.scala:376) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) ``` ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
