[jira] [Commented] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

Joris Van den Bossche (Jira) Wed, 25 Mar 2020 12:26:19 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067018#comment-17067018
 ]


Joris Van den Bossche commented on ARROW-8208:
----------------------------------------------

[~cclienti] feedback on those new functionalities is very welcome!

But so, since it's already possible and using this in ParquetDataset is covered 
by other issues, going to close this one.

> [PYTHON] Row Group Filtering With ParquetDataset
> ------------------------------------------------
>
>                 Key: ARROW-8208
>                 URL: https://issues.apache.org/jira/browse/ARROW-8208
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Christophe Clienti
>            Priority: Major
>              Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>                filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

Reply via email to