[
https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17660268#comment-17660268
]
Rok Mihevc commented on ARROW-3245:
-----------------------------------
This issue has been migrated to [issue
#19587|https://github.com/apache/arrow/issues/19587] on GitHub. Please see the
[migration documentation|https://github.com/apache/arrow/issues/14542] for
further details.
> [Python] Infer index and/or filtering from parquet column statistics
> --------------------------------------------------------------------
>
> Key: ARROW-3245
> URL: https://issues.apache.org/jira/browse/ARROW-3245
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Martin Durant
> Priority: Major
> Labels: dataset, dataset-parquet-read, parquet
>
> The metadata included in parquet generally gives the min/max of data for each
> chunk of each column. This allows early filtering out of whole chunks if they
> do not meet some criterion, and can greatly reduce reading burden in some
> circumstances. In Dask, we care about this for setting an index and its
> "divisions" (start/stop values for each data partition) and for directly
> avoiding including some chunks in the graph of tasks to be processed.
> Similarly, filtering may be applied on the values of fields defined by the
> directory partitioning.
> Currently, dask using the fastparquet backend is able to infer possible
> columns to use as an index, perform filtering on that index and do general
> filtering on any column which has statistical or partitioning information. It
> would be very helpful to have such facilities via pyarrow also.
> This is probably the most important of the requests from Dask.
> (please forgive that some of this has already been mentioned elsewhere; this
> is one of the entries in the list at
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful
> in fastparquet)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)