[ https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-3245: -------------------------------- Fix Version/s: 0.14.0 > [Python] Infer index and/or filtering from parquet column statistics > -------------------------------------------------------------------- > > Key: ARROW-3245 > URL: https://issues.apache.org/jira/browse/ARROW-3245 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Martin Durant > Priority: Major > Labels: parquet > Fix For: 0.14.0 > > > The metadata included in parquet generally gives the min/max of data for each > chunk of each column. This allows early filtering out of whole chunks if they > do not meet some criterion, and can greatly reduce reading burden in some > circumstances. In Dask, we care about this for setting an index and its > "divisions" (start/stop values for each data partition) and for directly > avoiding including some chunks in the graph of tasks to be processed. > Similarly, filtering may be applied on the values of fields defined by the > directory partitioning. > Currently, dask using the fastparquet backend is able to infer possible > columns to use as an index, perform filtering on that index and do general > filtering on any column which has statistical or partitioning information. It > would be very helpful to have such facilities via pyarrow also. > This is probably the most important of the requests from Dask. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.3#76005)