[ 
https://issues.apache.org/jira/browse/ARROW-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-10131.
-------------------------------------------
    Resolution: Fixed

Issue resolved by pull request 8507
[https://github.com/apache/arrow/pull/8507]

> [C++][Dataset] Lazily parse parquet metadata / statistics in 
> ParquetDatasetFactory and ParquetFileFragment
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10131
>                 URL: https://issues.apache.org/jira/browse/ARROW-10131
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Related to ARROW-9730, parsing of the statistics in parquet metadata is 
> expensive, and therefore should be avoided when possible.
> For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in 
> python) parses all statistics of all files and all columns. While when doing 
> a filtered read, you might only need the statistics of certain files (eg if a 
> filter on a partition field already excluded many files) and certain columns 
> (eg only the columns on which you are actually filtering).
> The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a 
> later EnsureCompleteMetadata parse all statistics, and don't allow parsing a 
> subset, or only parsing the other (non-statistics) metadata, ...), so I think 
> we should try to think of better abstractions.
> cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to