[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7011: Core: Store split offset for delete files

via GitHub Tue, 07 Mar 2023 23:04:06 -0800


singhpk234 commented on code in PR #7011:
URL: https://github.com/apache/iceberg/pull/7011#discussion_r1129010316



##########
core/src/main/java/org/apache/iceberg/FileMetadata.java:
##########
@@ -192,6 +196,15 @@ public Builder withMetrics(Metrics metrics) {
       return this;
     }
 
+    public Builder withSplitOffsets(List<Long> offsets) {
+      if (offsets != null) {
+        this.splitOffsets = KryoUtil.copyList(offsets);
+      } else {
+        this.splitOffsets = null;
+      }

Review Comment:
   Agree with you it's not required at moment, mostly took this from DataFiles 
builder, now that I think again IMHO we should set the value null explicitly if 
passed from builder, considering default can be updated or it can be updated by 
other builder method. please let me know your thoughts, happy to make changes 
accordingly ? 



##########
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##########
@@ -249,6 +249,23 @@ public void testBasicSplitPlanningDeleteFiles() {
     Assert.assertEquals(8, 
Iterables.size(posDeletesTable.newBatchScan().planTasks()));
   }
 
+  @Test
+  public void testBasicSplitPlanningDeleteFilesWithSplitOffsets() {
+    table.updateProperties().set(TableProperties.FORMAT_VERSION, "2").commit();
+    List<DeleteFile> files128Mb = newDeleteFiles(4, 128 * 1024 * 1024, 8);
+    appendDeleteFiles(files128Mb);
+
+    PositionDeletesTable posDeletesTable = new PositionDeletesTable(table);
+    // we expect 8 bins since split size is 64MB
+    Assert.assertEquals(

Review Comment:
   Sure thing, I checked, as per my understanding, the tasks belonging to same 
file and can be merged when `canMerge` is true will be merged into a new 
`SplitPositionDeletesScanTask`, so we will still have `no. of task groups = no. 
of tasks = no. of files` if per group we make 1 file with 2 splits, thinking if 
there is a way to test this i.e finding if these tasks were merged into new 
task. Hence added a UT where a complete file and half of other file will be 
included. 
   
   Please let me know your thoughts.



##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java:
##########
@@ -230,9 +230,17 @@ public void testSplitTasks() throws IOException {
 
     Table deleteTable =
         MetadataTableUtils.createMetadataTableInstance(tab, 
MetadataTableType.POSITION_DELETES);
-    Assert.assertTrue(
-        "Position delete scan should produce more than one split",
-        Iterables.size(deleteTable.newBatchScan().planTasks()) > 1);
+
+    if (format.equals(FileFormat.AVRO)) {
+      Assert.assertTrue(
+          "Position delete scan should produce more than one split",
+          Iterables.size(deleteTable.newBatchScan().planTasks()) > 1);
+    } else {
+      Assert.assertEquals(
+          "Position delete scan should produce one split",
+          1,
+          Iterables.size(deleteTable.newBatchScan().planTasks()));
+    }

Review Comment:
   At the moment, AvroFileAppender doesn't add split offsets (not sure how to 
fetch it from avro natively, considering 
https://github.com/apache/iceberg/blob/3b37f581371c1fd70d2a56f5bd900ec7a8436e7e/api/src/main/java/org/apache/iceberg/FileFormat.java#L27-L30,
 looks like avro does have splitOffsets) , In OrcFileAppender and ParquetWriter 
we fetch split offsets from ORC stripes / Parquet footer respectively.
   
   Hence the difference in assertion at the moment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7011: Core: Store split offset for delete files

Reply via email to