Re: [PR] Count rows as a metadata only operation [iceberg-python]

via GitHub Fri, 29 Nov 2024 05:36:37 -0800


Fokko commented on code in PR #1388:
URL: https://github.com/apache/iceberg-python/pull/1388#discussion_r1863537090



##########
pyiceberg/table/__init__.py:
##########
@@ -1493,6 +1496,13 @@ def to_ray(self) -> ray.data.dataset.Dataset:
 
         return ray.data.from_arrow(self.to_arrow())
 
+    def count(self) -> int:
+        res = 0
+        tasks = self.plan_files()

Review Comment:
   Yes, this can be widely off, not just because the merge-on-read deletes, but 
because `plan_files` returns all the files that (might) contain relevant rows. 
For example, if it cannot be determined if has relevant data, it will be 
returned by `plan_files`.
   
   I think there are two ways forward:
   
   - One is similar to how we handle deletes. For deletes, we check if the 
whole file matches, if this is the case, then we can simply drop the file from 
the metadata. You can find the code 
[here](https://github.com/apache/iceberg-python/blob/5bef1bfe677df7016dc37b8db87650c34faceb7a/pyiceberg/table/update/snapshot.py#L359-L427).
 If a file fully matches, is is valid to use `task.file.record_count`. We would 
need to extend this to also see if there are also merge-on-read deletes as 
Kevin already mentioned, or just fail when there are positional deletes.
   - A cleaner option, but this is a bit more work, but pretty exciting, would 
be to include the `residual-predicate` in the `FileScanTask`. When we run a 
query, like `day_added = 2024-12-01 and user_id = 10`, then the `day_added = 
2024-12-01` might be satisfied with the partitioning already. This is the case 
when the table is partitioned by day, and we know that all the data in the file 
evaluates `true` for `day_added = 2024-12-01`, then we need to open the file, 
and filter for `user_id = 10`. If we would leave out the `user_id = 10`, then 
it would be `ALWAYS_TRUE`, and then we know that we can just use 
`task.file.record_count`. This way we could very easily loop over the 
`.plan_files()`:
   ```python
   def count(self) -> int:
       res = 0
       tasks = self.plan_files()
       for task in tasks:
           if task.residual == ALWAYS_TRUE and len(task.delete_files):
               res += task.file.record_count
               else:
               # pseudocode, open the table, and apply the filter and deletes
               res += len(_table_from_scan_task(task))
       return res
   ```
   
   To get to the second step, we first have to port the `ResidualEvaluator`. 
The java code can be found 
[here](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java),
 including some [excellent 
tests](https://github.com/apache/iceberg/blob/main/api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Count rows as a metadata only operation [iceberg-python]

Reply via email to