mstebelev opened a new issue, #7325: URL: https://github.com/apache/iceberg/issues/7325
### Query engine Spark ### Question I noticed that RewriteManifests in the end tries to copy each manifest sequentially in the single thread and it takes a lot of time. The stack in UI looks like this: ``` [email protected]/java.util.zip.Inflater.inflateBytesBytes(Native Method) [email protected]/java.util.zip.Inflater.inflate(Inflater.java:385) => holding Monitor(java.util.zip.Inflater$InflaterZStreamRef@1147147444}) [email protected]/java.util.zip.InflaterOutputStream.write(InflaterOutputStream.java:253) app//org.apache.iceberg.shaded.org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:83) app//org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:392) app//org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:226) app//org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.hasNext(AvroIterable.java:191) app//org.apache.iceberg.io.CloseableIterable$7$1.hasNext(CloseableIterable.java:197) app//org.apache.iceberg.ManifestFiles.copyManifestInternal(ManifestFiles.java:311) app//org.apache.iceberg.ManifestFiles.copyRewriteManifest(ManifestFiles.java:288) app//org.apache.iceberg.BaseRewriteManifests.copyManifest(BaseRewriteManifests.java:166) app//org.apache.iceberg.BaseRewriteManifests.addManifest(BaseRewriteManifests.java:155) app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction$$Lambda$3660/0x00007f746f6fe4b0.accept(Unknown Source) [email protected]/java.util.Arrays$ArrayList.forEach(Arrays.java:4390) app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction.replaceManifests(RewriteManifestsSparkAction.java:342) app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction.doExecute(RewriteManifestsSparkAction.java:193) app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction$$Lambda$2901/0x00007f74e63aec58.get(Unknown Source) app//org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:127) app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction.execute(RewriteManifestsSparkAction.java:158) app//org.apache.iceberg.spark.procedures.RewriteManifestsProcedure.lambda$call$0(RewriteManifestsProcedure.java:107) app//org.apache.iceberg.spark.procedures.RewriteManifestsProcedure$$Lambda$2897/0x00007f74e636c508.apply(Unknown Source) app//org.apache.iceberg.spark.procedures.BaseProcedure.execute(BaseProcedure.java:100) app//org.apache.iceberg.spark.procedures.BaseProcedure.modifyIcebergTable(BaseProcedure.java:81) app//org.apache.iceberg.spark.procedures.RewriteManifestsProcedure.call(RewriteManifestsProcedure.java:92) app//org.apache.spark.sql.execution.datasources.v2.CallExec.run(CallExec.scala:34) ``` After looking in the code I found out that I can probably disable this copying by setting property 'compatibility.snapshot-id-inheritance.enabled'='true', but it is poorly documented and I'm not sure is it safe to use. After reading discussion in https://github.com/apache/iceberg/pull/675 looks like it is a flag for writing manifests so that old versions of readers are able to read it. It is the only purpose? Can somebody provide any insight on consequences of setting that property or advice how to improve RewriteManifests performance with different way -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
