[GitHub] [iceberg] jackye1995 commented on a change in pull request #3454: Core: add delete option for bin-packing

GitBox Wed, 03 Nov 2021 17:26:05 -0700


jackye1995 commented on a change in pull request #3454:
URL: https://github.com/apache/iceberg/pull/3454#discussion_r741539664




##########
File path: 
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java
##########
@@ -177,6 +187,59 @@ public void testBinPackWithFilter() {
     assertEquals("Rows must match", expectedRecords, actualRecords);
   }
 
+  @Test
+  public void testBinPackWithDeletes() throws Exception {
+    Table table = createTablePartitioned(4, 2);
+    table.updateProperties().set(TableProperties.FORMAT_VERSION, "2").commit();
+    shouldHaveFiles(table, 8);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = 
Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    GenericAppenderFactory appenderFactory = new 
GenericAppenderFactory(table.schema(), table.spec(),
+        null, null, null);
+    int total = (int) 
dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove 2 rows for odd files, 1 row for even files
+    for (int i = 0; i < dataFiles.size(); i++) {
+      DataFile dataFile = dataFiles.get(i);
+      EncryptedOutputFile outputFile = EncryptedFiles.encryptedOutput(
+          
table.io().newOutputFile(table.locationProvider().newDataLocation(UUID.randomUUID().toString())),
+          EncryptionKeyMetadata.EMPTY);
+      PositionDeleteWriter<Record> posDeleteWriter = 
appenderFactory.newPosDeleteWriter(
+          outputFile, FileFormat.PARQUET, dataFile.partition());
+      posDeleteWriter.delete(dataFile.path(), 0);
+      posDeleteWriter.close();
+      rowDelta.addDeletes(posDeleteWriter.toDeleteFile());
+
+      if (i % 2 != 0) {
+        outputFile = EncryptedFiles.encryptedOutput(
+            
table.io().newOutputFile(table.locationProvider().newDataLocation(UUID.randomUUID().toString())),
+            EncryptionKeyMetadata.EMPTY);
+        posDeleteWriter = appenderFactory.newPosDeleteWriter(outputFile, 
FileFormat.PARQUET, dataFile.partition());
+        posDeleteWriter.delete(dataFile.path(), 1);
+        posDeleteWriter.close();
+        rowDelta.addDeletes(posDeleteWriter.toDeleteFile());
+      }

Review comment:
       I know there are some repeated code here for generating deletes. So far 
I am still not sure what is the correct boundary to create util methods. I am 
planning to refactor after I add more tests for the `RewriteDeleteStrategy`

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -75,7 +75,20 @@
   public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
   public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
 
+  /**
+   * The minimum number of deletes that needs to be associated with a data 
file for it to be considered for rewriting.
+   * If a data file has more than this number of deletes, it will be rewritten 
regardless of its file size determined
+   * by {@link #MIN_FILE_SIZE_BYTES} and {@link #MAX_FILE_SIZE_BYTES}.
+   * If a file group contains a file that satisfies this condition, the file 
group will be rewritten regardless of
+   * the number of files in the file group determined by {@link 
#MIN_INPUT_FILES}
+   * <p>
+   * Defaults to Integer.MAX_VALUE, which means this feature is not enabled by 
default.
+   */
+  public static final String MIN_DELETES_PER_FILE = "min-deletes-per-file";

Review comment:
       I think that is equivalent to setting this value to 0?

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -75,7 +75,20 @@
   public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
   public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
 
+  /**
+   * The minimum number of deletes that needs to be associated with a data 
file for it to be considered for rewriting.
+   * If a data file has more than this number of deletes, it will be rewritten 
regardless of its file size determined
+   * by {@link #MIN_FILE_SIZE_BYTES} and {@link #MAX_FILE_SIZE_BYTES}.
+   * If a file group contains a file that satisfies this condition, the file 
group will be rewritten regardless of
+   * the number of files in the file group determined by {@link 
#MIN_INPUT_FILES}
+   * <p>
+   * Defaults to Integer.MAX_VALUE, which means this feature is not enabled by 
default.
+   */
+  public static final String MIN_DELETES_PER_FILE = "min-deletes-per-file";

Review comment:
       I think the exact value to set depends on the user's tolerance of read 
performance, because more deletes means worse read performance and potentially 
getting out of memory, so users can tune this value based on their system 
requirements.

##########
File path: 
core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -33,6 +33,7 @@
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.assertj.core.util.Lists;

Review comment:
       oh nice catch, should not use that. We need to add it to checkstyle.

##########
File path: 
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java
##########
@@ -177,6 +187,59 @@ public void testBinPackWithFilter() {
     assertEquals("Rows must match", expectedRecords, actualRecords);
   }
 
+  @Test
+  public void testBinPackWithDeletes() throws Exception {
+    Table table = createTablePartitioned(4, 2);

Review comment:
       I think this is not needed, because you can see the meaning of the 
parameters in Intellij




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #3454: Core: add delete option for bin-packing

Reply via email to