[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2020-04-10 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080594#comment-17080594
 ] 

Wes McKinney commented on ARROW-1796:
-

Let's close as soon as it's documented

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet, 
> pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2020-02-05 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030696#comment-17030696
 ] 

Joris Van den Bossche commented on ARROW-1796:
--

I think we can close this issue, since this is now possible with the dataset 
API? 

(we can have a separate one about actually using this in 
{{pyarrow.parquet.read_table}} filter argument.

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2018-09-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616315#comment-16616315
 ] 

Wes McKinney commented on ARROW-1796:
-

Since we're on a critical path to get 0.11 out in the next week or two, I'm 
moving this to 0.12

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.12.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2018-09-05 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604756#comment-16604756
 ] 

Robert Gruener commented on ARROW-1796:
---

That sounds good to me. I would like to point out it would be nice if it would 
be possible to apply it at the ParquetDataset level as well extending the 
filter parameter that already exists to handle both hive partitions and row 
group level filtering 
[https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L777] It 
could do this by using the summary _metadata file or by reading all footers.

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2018-08-03 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568182#comment-16568182
 ] 

Uwe L. Korn commented on ARROW-1796:


As an interface I would add a new kwarg to {{read_table}} called filters that 
accepts a list of list of tuples. This will be in disjunctive normal form 
representation. The innermost triples consist of {{(column_name, operation, 
value(s))}}, e.g. {{('name', '==', 'John')}}. These innermost triples are 
combined into a list and all predicates in this list and combined with {{AND}}. 
The outer list is then an {{OR}} combination of the {{AND}}-combined triples.

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2018-08-03 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568181#comment-16568181
 ] 

Uwe L. Korn commented on ARROW-1796:


I would start by contributing a pure Python implementation that already 
implements all necessary filters and then we can move the predicate evaluation 
either to Gandiva or pre-compiled C++. The pure Python pass is much simpler as 
a first step and provides already a working interface at acceptable performance.

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

2018-06-29 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527852#comment-16527852
 ] 

Wes McKinney commented on ARROW-1796:
-

If Gandiva becomes a part of Apache Arrow, then we should look at compiling 
filters and pushing them down into parquet-cpp

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)