[ https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557549#comment-17557549 ]
Neal Richardson commented on ARROW-16616: ----------------------------------------- In the R package we handle this by building a separate query object that contains the Dataset (or RecordBatchReader, or Table, or whatever), and the query object contains the filtering etc. methods. So here, Dataset.filter would first create the query object from the Dataset and then calls the filter method on it. That said, going that route (or the FilteredDataset path, for that matter) feels like a path to essentially creating ibis in pyarrow. I'd caution you to think twice before adding APIs like this since they become sticky and hard to remove later (look at how long ParquetDataset has lived past the creation of the general Dataset class). > [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter > method > --------------------------------------------------------------------------------- > > Key: ARROW-16616 > URL: https://issues.apache.org/jira/browse/ARROW-16616 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python > Reporter: Alessandro Molina > Assignee: Alessandro Molina > Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > To keep the {{Dataset}} api compatible with the {{Table}} one in terms of > analytics capabilities, we should add a {{Dataset.filter}} method. The > initial POC was based on {{_table_filter}} but that required materialising > all the {{Dataset}} content after filtering as it returned an > {{{}InMemoryDataset{}}}. > Given that {{Scanner}} can filter a dataset without actually materialising > the data until a final step happens, it would be good to have > {{Dataset.filter}} return some form of lazy dataset when the filter is only > stored aside and the Scanner is created when data is actually retrieved. > PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}} > method -- This message was sent by Atlassian Jira (v8.20.7#820007)