[ https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557599#comment-17557599 ]
Alessandro Molina edited comment on ARROW-16616 at 6/22/22 5:52 PM: -------------------------------------------------------------------- I don't see much of a conflict with Ibis on long term. Ibis is an interface to multiple query engines and the queries you build there can be run against different targets. It serves the purpose of having a single interface to an infrastructure agnostic environment. Write your Ibis code locally and deploy it against a production system that might run something very different from a Acero in-memory. PyArrow instead only exposes the features that are already available in Arrow itself, and it's something all the other bindings are doing too (R and Java do expose access to the compute engine). What you write in pyarrow won't be able to really grow too much in the direction of scaling it, so it's better positioned for quick data discovery without having to involve external dependencies than for the actual final product which will probably want to be based on IBIS. was (Author: amol-): I don't see much of a conflict with Ibis on long term. Ibis is an interface to multiple query engines and the queries you build there can be run against different targets. It serves the purpose of having a single interface to an infrastructure agnostic environment. Write your Ibis code locally and deploy it against a production system that might run something very different from a Acero in-memory. PyArrow instead exposes the features that are already available in Arrow itself, and it's something all the other bindings are doing too (R and Java do expose access to the compute engine). What you write in pyarrow won't be able to really grow too much in the direction of scaling it, so it's better positioned for quick data discovery without having to involve external dependencies than for the actual final product which will probably want to be based on IBIS. > [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter > method > --------------------------------------------------------------------------------- > > Key: ARROW-16616 > URL: https://issues.apache.org/jira/browse/ARROW-16616 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python > Reporter: Alessandro Molina > Assignee: Alessandro Molina > Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > To keep the {{Dataset}} api compatible with the {{Table}} one in terms of > analytics capabilities, we should add a {{Dataset.filter}} method. The > initial POC was based on {{_table_filter}} but that required materialising > all the {{Dataset}} content after filtering as it returned an > {{{}InMemoryDataset{}}}. > Given that {{Scanner}} can filter a dataset without actually materialising > the data until a final step happens, it would be good to have > {{Dataset.filter}} return some form of lazy dataset when the filter is only > stored aside and the Scanner is created when data is actually retrieved. > PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}} > method -- This message was sent by Atlassian Jira (v8.20.7#820007)