Hi .. Right now I am using something like this:

ArrowScanner.from_batches(pa_table.to_batches(), filter=my_expression).

I was wondering if there is a more efficient way to do this filtering if I
have to exclude some of the rows.

As of now I am changing my expression to something like my_expression &
pc.field('row_id').isin(row_ids).

This filter might be actually doing lot of extra work to match the in
clause for the row ids. Is there someway to direct the to batches to
exclude the rows ahead of time based on a boolean mask.

ArrowScanner.from_batches(pa_table.to_batches(my_mask),
filter=my_expression).


Thanks

Reply via email to