[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level
[ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080594#comment-17080594 ] Wes McKinney commented on ARROW-1796: - Let's close as soon as it's documented > [Python] RowGroup filtering on file level > - > > Key: ARROW-1796 > URL: https://issues.apache.org/jira/browse/ARROW-1796 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: dataset, dataset-parquet-read, parquet, > pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > We can build upon the API defined in {{fastparquet}} for defining RowGroup > filters: > https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 > and translate them into the C++ enums we will define in > https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to > provide the user with a simple predicate pushdown API that we can extend in > the background from RowGroup to Page level later on. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level
[ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030696#comment-17030696 ] Joris Van den Bossche commented on ARROW-1796: -- I think we can close this issue, since this is now possible with the dataset API? (we can have a separate one about actually using this in {{pyarrow.parquet.read_table}} filter argument. > [Python] RowGroup filtering on file level > - > > Key: ARROW-1796 > URL: https://issues.apache.org/jira/browse/ARROW-1796 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: parquet, pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > We can build upon the API defined in {{fastparquet}} for defining RowGroup > filters: > https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 > and translate them into the C++ enums we will define in > https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to > provide the user with a simple predicate pushdown API that we can extend in > the background from RowGroup to Page level later on. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level
[ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616315#comment-16616315 ] Wes McKinney commented on ARROW-1796: - Since we're on a critical path to get 0.11 out in the next week or two, I'm moving this to 0.12 > [Python] RowGroup filtering on file level > - > > Key: ARROW-1796 > URL: https://issues.apache.org/jira/browse/ARROW-1796 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.12.0 > > > We can build upon the API defined in {{fastparquet}} for defining RowGroup > filters: > https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 > and translate them into the C++ enums we will define in > https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to > provide the user with a simple predicate pushdown API that we can extend in > the background from RowGroup to Page level later on. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level
[ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604756#comment-16604756 ] Robert Gruener commented on ARROW-1796: --- That sounds good to me. I would like to point out it would be nice if it would be possible to apply it at the ParquetDataset level as well extending the filter parameter that already exists to handle both hive partitions and row group level filtering [https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L777] It could do this by using the summary _metadata file or by reading all footers. > [Python] RowGroup filtering on file level > - > > Key: ARROW-1796 > URL: https://issues.apache.org/jira/browse/ARROW-1796 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > We can build upon the API defined in {{fastparquet}} for defining RowGroup > filters: > https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 > and translate them into the C++ enums we will define in > https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to > provide the user with a simple predicate pushdown API that we can extend in > the background from RowGroup to Page level later on. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level
[ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568182#comment-16568182 ] Uwe L. Korn commented on ARROW-1796: As an interface I would add a new kwarg to {{read_table}} called filters that accepts a list of list of tuples. This will be in disjunctive normal form representation. The innermost triples consist of {{(column_name, operation, value(s))}}, e.g. {{('name', '==', 'John')}}. These innermost triples are combined into a list and all predicates in this list and combined with {{AND}}. The outer list is then an {{OR}} combination of the {{AND}}-combined triples. > [Python] RowGroup filtering on file level > - > > Key: ARROW-1796 > URL: https://issues.apache.org/jira/browse/ARROW-1796 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > We can build upon the API defined in {{fastparquet}} for defining RowGroup > filters: > https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 > and translate them into the C++ enums we will define in > https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to > provide the user with a simple predicate pushdown API that we can extend in > the background from RowGroup to Page level later on. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level
[ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568181#comment-16568181 ] Uwe L. Korn commented on ARROW-1796: I would start by contributing a pure Python implementation that already implements all necessary filters and then we can move the predicate evaluation either to Gandiva or pre-compiled C++. The pure Python pass is much simpler as a first step and provides already a working interface at acceptable performance. > [Python] RowGroup filtering on file level > - > > Key: ARROW-1796 > URL: https://issues.apache.org/jira/browse/ARROW-1796 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > We can build upon the API defined in {{fastparquet}} for defining RowGroup > filters: > https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 > and translate them into the C++ enums we will define in > https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to > provide the user with a simple predicate pushdown API that we can extend in > the background from RowGroup to Page level later on. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level
[ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527852#comment-16527852 ] Wes McKinney commented on ARROW-1796: - If Gandiva becomes a part of Apache Arrow, then we should look at compiling filters and pushing them down into parquet-cpp > [Python] RowGroup filtering on file level > - > > Key: ARROW-1796 > URL: https://issues.apache.org/jira/browse/ARROW-1796 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > We can build upon the API defined in {{fastparquet}} for defining RowGroup > filters: > https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 > and translate them into the C++ enums we will define in > https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to > provide the user with a simple predicate pushdown API that we can extend in > the background from RowGroup to Page level later on. -- This message was sent by Atlassian JIRA (v7.6.3#76005)