[ https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reopened ARROW-8208: ------------------------------------------ > [PYTHON] Row Group Filtering With ParquetDataset > ------------------------------------------------ > > Key: ARROW-8208 > URL: https://issues.apache.org/jira/browse/ARROW-8208 > Project: Apache Arrow > Issue Type: New Feature > Reporter: Christophe Clienti > Priority: Major > Labels: dataset, dataset-parquet-read > > Hello, > I tried to use the row_group filtering at the file level with an instance of > ParquetDataset without success. > I've tested the workaround proposed here: > [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883] > But I wonder if it can work on a file as I get an exception with the > following code: > {code:python} > ParquetDataset('data.parquet', > filters=[('ticker', '=', 'AAPL')]).read().to_pandas() > {code} > {noformat} > AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition' > {noformat} > I read the documentation, and the filtering seems to work only on partitioned > dataset. Moreover I read some information in the following JIRA ticket: > ARROW-1796 > So I'm not sure that a ParquetDataset can use row_group statistics to filter > specific row_group in a file (in a dataset or not)? > As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug > (statistics.min instead of statistics.min_value), I was able to apply the > row_group filtering. > Today I'm forced with pyarrow to filter manually the row_groups in each file, > which prevents me to use the ParquetDataset partition filtering functionality. > The row groups are really useful because it prevents to fill the filesystem > with small files... -- This message was sent by Atlassian Jira (v8.3.4#803005)