RussellSpitzer commented on a change in pull request #3454:
URL: https://github.com/apache/iceberg/pull/3454#discussion_r741622389
##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -75,7 +75,20 @@
public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+ /**
+ * The minimum number of deletes that needs to be associated with a data
file for it to be considered for rewriting.
+ * If a data file has more than this number of deletes, it will be rewritten
regardless of its file size determined
+ * by {@link #MIN_FILE_SIZE_BYTES} and {@link #MAX_FILE_SIZE_BYTES}.
+ * If a file group contains a file that satisfies this condition, the file
group will be rewritten regardless of
+ * the number of files in the file group determined by {@link
#MIN_INPUT_FILES}
+ * <p>
+ * Defaults to Integer.MAX_VALUE, which means this feature is not enabled by
default.
+ */
+ public static final String MIN_DELETES_PER_FILE = "min-deletes-per-file";
Review comment:
I think it's fine to just do this based on the number of files since the
read penalty is directly related to the number of files and less so to the
amount of actual rows deleted.
No strong feeling on the default since we already have the amount of delete
files in the task information so I don't think the check is very expensive
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]