singhpk234 commented on code in PR #7011:
URL: https://github.com/apache/iceberg/pull/7011#discussion_r1129010316
##########
core/src/main/java/org/apache/iceberg/FileMetadata.java:
##########
@@ -192,6 +196,15 @@ public Builder withMetrics(Metrics metrics) {
return this;
}
+ public Builder withSplitOffsets(List<Long> offsets) {
+ if (offsets != null) {
+ this.splitOffsets = KryoUtil.copyList(offsets);
+ } else {
+ this.splitOffsets = null;
+ }
Review Comment:
Agree with you it's not required at moment, mostly took this from DataFiles
builder, now that I think again IMHO we should set the value null explicitly if
passed from builder, considering default can be updated or it can be updated by
other builder method. please let me know your thoughts, happy to make changes
accordingly ?
##########
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##########
@@ -249,6 +249,23 @@ public void testBasicSplitPlanningDeleteFiles() {
Assert.assertEquals(8,
Iterables.size(posDeletesTable.newBatchScan().planTasks()));
}
+ @Test
+ public void testBasicSplitPlanningDeleteFilesWithSplitOffsets() {
+ table.updateProperties().set(TableProperties.FORMAT_VERSION, "2").commit();
+ List<DeleteFile> files128Mb = newDeleteFiles(4, 128 * 1024 * 1024, 8);
+ appendDeleteFiles(files128Mb);
+
+ PositionDeletesTable posDeletesTable = new PositionDeletesTable(table);
+ // we expect 8 bins since split size is 64MB
+ Assert.assertEquals(
Review Comment:
Sure thing, I checked, as per my understanding, the tasks belonging to same
file and can be merged when `canMerge` is true will be merged into a new
`SplitPositionDeletesScanTask`, so we will still have `no. of task groups = no.
of tasks = no. of files` if per group we make 1 file with 2 splits, thinking if
there is a way to test this i.e finding if these tasks were merged into new
task. Hence added a UT where a complete file and half of other file will be
included.
Please let me know your thoughts.
##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java:
##########
@@ -230,9 +230,17 @@ public void testSplitTasks() throws IOException {
Table deleteTable =
MetadataTableUtils.createMetadataTableInstance(tab,
MetadataTableType.POSITION_DELETES);
- Assert.assertTrue(
- "Position delete scan should produce more than one split",
- Iterables.size(deleteTable.newBatchScan().planTasks()) > 1);
+
+ if (format.equals(FileFormat.AVRO)) {
+ Assert.assertTrue(
+ "Position delete scan should produce more than one split",
+ Iterables.size(deleteTable.newBatchScan().planTasks()) > 1);
+ } else {
+ Assert.assertEquals(
+ "Position delete scan should produce one split",
+ 1,
+ Iterables.size(deleteTable.newBatchScan().planTasks()));
+ }
Review Comment:
At the moment, AvroFileAppender doesn't add split offsets (not sure how to
fetch it from avro natively, considering
https://github.com/apache/iceberg/blob/3b37f581371c1fd70d2a56f5bd900ec7a8436e7e/api/src/main/java/org/apache/iceberg/FileFormat.java#L27-L30,
looks like avro does have splitOffsets) , In OrcFileAppender and ParquetWriter
we fetch split offsets from ORC stripes / Parquet footer respectively.
Hence the difference in assertion at the moment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]