rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1051680991
##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +146,18 @@ public static <T extends StructLike> PositionDeleteIndex
toPositionIndex(
deletes ->
CloseableIterable.transform(
locationFilter.filter(deletes), row -> (Long)
POSITION_ACCESSOR.get(row)));
- return toPositionIndex(CloseableIterable.concat(positions));
+ return toPositionIndex(positions);
+ }
+
+ public static PositionDeleteIndex
toPositionIndex(List<CloseableIterable<Long>> positions) {
Review Comment:
Thanks @rdblue. Yes, this happens when there are more than one "delete
positional file" that qualifies for the data file. E.g Assume a trickle feed
job ingests data into the partition. Due to late arriving data, another job
updates the dataset for certain dataset in the partition & creates "positional
files (POS)". For update jobs with different criteria, same data file may get
qualified and creates additional POS files. Essentially during scanning, one
data file may have to scan multiple POS files (e.g 4 pos files) and causes
slowness. ParallelIterable helps in this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]