Fokko commented on code in PR #1026:
URL: https://github.com/apache/iceberg-python/pull/1026#discussion_r1711043760
##########
pyiceberg/io/pyarrow.py:
##########
@@ -1249,11 +1251,12 @@ def _task_to_record_batches(
# https://github.com/apache/arrow/issues/39220
arrow_table = pa.Table.from_batches([batch])
arrow_table = arrow_table.filter(pyarrow_filter)
+ if len(arrow_table) == 0:
+ continue
batch = arrow_table.to_batches()[0]
yield _to_requested_schema(
projected_schema, file_project_schema, batch,
downcast_ns_timestamp_to_us=True, use_large_types=use_large_types
)
- current_index += len(batch)
Review Comment:
Oof, that's a good find. Thanks @vhnguyenae for reporting this!
The order of applying filters also caught me when implementing positional
deletes. In the long run, I think it would be good to push this down to Arrow,
I created an issue a while ago: https://github.com/apache/arrow/issues/35301
But that hasn't seen much traction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]