Fokko commented on code in PR #1388:
URL: https://github.com/apache/iceberg-python/pull/1388#discussion_r1863537090
##########
pyiceberg/table/__init__.py:
##########
@@ -1493,6 +1496,13 @@ def to_ray(self) -> ray.data.dataset.Dataset:
return ray.data.from_arrow(self.to_arrow())
+ def count(self) -> int:
+ res = 0
+ tasks = self.plan_files()
Review Comment:
Yes, this can be widely off, not just because the merge-on-read deletes, but
because `plan_files` returns all the files that (might) contain relevant rows.
For example, if it cannot be determined if has relevant data, it will be
returned by `plan_files`.
I think there are two ways forward:
- One is similar to how we handle deletes. For deletes, we check if the
whole file matches, if this is the case, then we can simply drop the file from
the metadata. You can find the code
[here](https://github.com/apache/iceberg-python/blob/5bef1bfe677df7016dc37b8db87650c34faceb7a/pyiceberg/table/update/snapshot.py#L359-L427).
If a file fully matches, is is valid to use `task.file.record_count`. We would
need to extend this to also see if there are also merge-on-read deletes as
Kevin already mentioned, or just fail when there are positional deletes.
- A cleaner option, but this is a bit more work, but pretty exciting, would
be to include the `residual-predicate` in the `FileScanTask`. When we run a
query, like `day_added = 2024-12-01 and user_id = 10`, then the `day_added =
2024-12-01` might be satisfied with the partitioning already. This is the case
when the table is partitioned by day, and we know that all the data in the file
evaluates `true` for `day_added = 2024-12-01`, then we need to open the file,
and filter for `user_id = 10`. If we would leave out the `user_id = 10`, then
it would be `ALWAYS_TRUE`, and then we know that we can just use
`task.file.record_count`. This way we could very easily loop over the
`.plan_files()`:
```python
def count(self) -> int:
res = 0
tasks = self.plan_files()
for task in tasks:
if task.residual == ALWAYS_TRUE and len(task.delete_files):
res += task.file.record_count
else:
# pseudocode, open the table, and apply the filter and deletes
res += len(_table_from_scan_task(task))
return res
```
To get to the second step, we first have to port the `ResidualEvaluator`.
The java code can be found
[here](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java),
including some [excellent
tests](https://github.com/apache/iceberg/blob/main/api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]