nastra opened a new pull request, #13222: URL: https://github.com/apache/iceberg/pull/13222
This keeps track of all data files to be removed/rewritten in **MergingSnapshotProducer** and passes those to the **ManifestFilterManager** for deletes. Once **ManifestFilterManager** goes through delete manifests, it also checks whether a DV references any of the data files to be removed. This is needed in addition to https://github.com/apache/iceberg/pull/13245 so that we can properly remove orphaned DVs when e.g. a metadata-only delete is performed. The below benchmark shows that tracking data files to be removed and then detecting orphaned DVs when delete manifests are looked at is only adding a small fraction to the throughput. without tracking data files to be removed ========================================= ``` Benchmark (numFiles) (percentDataFilesRewritten) Mode Cnt Score Error Units RewriteDataFilesBenchmark.rewriteDataFiles 50000 5 ss 5 0.497 ± 0.061 s/op RewriteDataFilesBenchmark.rewriteDataFiles 50000 25 ss 5 0.649 ± 0.080 s/op RewriteDataFilesBenchmark.rewriteDataFiles 50000 50 ss 5 1.889 ± 0.096 s/op RewriteDataFilesBenchmark.rewriteDataFiles 50000 100 ss 5 2.093 ± 0.125 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 5 ss 5 0.503 ± 0.040 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 25 ss 5 1.941 ± 0.154 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 50 ss 5 2.139 ± 0.165 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 100 ss 5 2.474 ± 0.149 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 5 ss 5 1.054 ± 0.067 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 25 ss 5 2.577 ± 0.247 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 50 ss 5 3.318 ± 1.121 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 100 ss 5 5.792 ± 1.725 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 5 ss 5 1.352 ± 0.122 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 25 ss 5 3.252 ± 0.325 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 50 ss 5 4.887 ± 0.548 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 100 ss 5 8.297 ± 1.991 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 5 ss 5 2.536 ± 0.232 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 25 ss 5 5.227 ± 1.042 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 50 ss 5 7.545 ± 2.052 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 100 ss 5 18.058 ± 4.773 s/op ``` with tracking data files to be removed ====================================== ``` RewriteDataFilesBenchmark.rewriteDataFiles 50000 5 ss 5 0.626 ± 0.080 s/op RewriteDataFilesBenchmark.rewriteDataFiles 50000 25 ss 5 0.813 ± 0.081 s/op RewriteDataFilesBenchmark.rewriteDataFiles 50000 50 ss 5 2.013 ± 0.037 s/op RewriteDataFilesBenchmark.rewriteDataFiles 50000 100 ss 5 2.316 ± 0.137 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 5 ss 5 0.676 ± 0.036 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 25 ss 5 2.036 ± 0.105 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 50 ss 5 2.125 ± 0.107 s/op RewriteDataFilesBenchmark.rewriteDataFiles 100000 100 ss 5 2.856 ± 0.138 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 5 ss 5 1.319 ± 0.049 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 25 ss 5 3.314 ± 0.265 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 50 ss 5 4.219 ± 0.498 s/op RewriteDataFilesBenchmark.rewriteDataFiles 500000 100 ss 5 6.164 ± 0.736 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 5 ss 5 2.010 ± 0.224 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 25 ss 5 3.961 ± 0.137 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 50 ss 5 5.771 ± 0.839 s/op RewriteDataFilesBenchmark.rewriteDataFiles 1000000 100 ss 5 9.082 ± 0.869 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 5 ss 5 3.772 ± 0.602 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 25 ss 5 6.558 ± 0.939 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 50 ss 5 9.179 ± 1.330 s/op RewriteDataFilesBenchmark.rewriteDataFiles 2000000 100 ss 5 23.623 ± 8.387 s/op ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
