[ 
https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557549#comment-17557549
 ] 

Neal Richardson commented on ARROW-16616:
-----------------------------------------

In the R package we handle this by building a separate query object that 
contains the Dataset (or RecordBatchReader, or Table, or whatever), and the 
query object contains the filtering etc. methods. So here, Dataset.filter would 
first create the query object from the Dataset and then calls the filter method 
on it. 

That said, going that route (or the FilteredDataset path, for that matter) 
feels like a path to essentially creating ibis in pyarrow. I'd caution you to 
think twice before adding APIs like this since they become sticky and hard to 
remove later (look at how long ParquetDataset has lived past the creation of 
the general Dataset class).

> [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter 
> method
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-16616
>                 URL: https://issues.apache.org/jira/browse/ARROW-16616
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Python
>            Reporter: Alessandro Molina
>            Assignee: Alessandro Molina
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 9.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> To keep the {{Dataset}} api compatible with the {{Table}} one in terms of 
> analytics capabilities, we should add a {{Dataset.filter}} method. The 
> initial POC was based on {{_table_filter}} but that required materialising 
> all the {{Dataset}} content after filtering as it returned an 
> {{{}InMemoryDataset{}}}. 
> Given that {{Scanner}} can filter a dataset without actually materialising 
> the data until a final step happens, it would be good to have 
> {{Dataset.filter}} return some form of lazy dataset when the filter is only 
> stored aside and the Scanner is created when data is actually retrieved.
> PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}} 
> method



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to