wypoon commented on PR #5742:
URL: https://github.com/apache/iceberg/pull/5742#issuecomment-1260027746
The code paths that goes through the streaming filters --
`Deletes.streamingMarker` and `Deletes.streamingFilter` --
```
return hasIsDeletedColumn
? Deletes.streamingMarker(
records, this::pos, Deletes.deletePositions(filePath, deletes),
this::markRowDeleted)
: Deletes.streamingFilter(
records, this::pos, Deletes.deletePositions(filePath, deletes),
counter);
```
are called in `DeleteFilter#applyPosDeletes`, which is called by
`DeleteFilter#filter`.
In the two streaming filters, there is a counter with state.
`this::markRowDeleted` ends up being
`BaseReader.SparkDeleteFilter#markRowDeleted`, which uses the counter in the
reader. In the other one, the counter is again the counter from the reader
which is passed into the constructor of the `DeleteFilter`.
In addition, `BaseReader.SparkDeleteFilter#markRowDeleted` has to account
for additional state:
```
if (!row.getBoolean(columnIsDeletedPosition())) {
row.setBoolean(columnIsDeletedPosition(), true);
counter().increment();
}
```
In order to avoid double-counting, it checks if the row has already been
marked deleted. This happens to come into play in `Deletes.markDeleted` which
is called by (among others) `DeleteFilter#applyEqDeletes(CloseableIterable<T>)`
via `DeleteFilter#createDeleteIterable`. In this case, the applyEqDeletes is
applied after the applyPosDeletes, but the point is that there is a lot of
interaction of different parts of **stateful** code. Unless we are very clear
that we know exactly what to test for, it is not obvious that just testing a
call to `Deletes.streamingMarker` or a call to `Deletes.streamingFilter` has
exercised the logic correctly.
That is why I think testing at the level of `TestSparkReaderDeletes` makes
it obvious what we've exercised.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]