[ https://issues.apache.org/jira/browse/ARROW-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche resolved ARROW-10131. ------------------------------------------- Resolution: Fixed Issue resolved by pull request 8507 [https://github.com/apache/arrow/pull/8507] > [C++][Dataset] Lazily parse parquet metadata / statistics in > ParquetDatasetFactory and ParquetFileFragment > ---------------------------------------------------------------------------------------------------------- > > Key: ARROW-10131 > URL: https://issues.apache.org/jira/browse/ARROW-10131 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Joris Van den Bossche > Priority: Major > Labels: dataset, dataset-dask-integration, pull-request-available > Fix For: 3.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Related to ARROW-9730, parsing of the statistics in parquet metadata is > expensive, and therefore should be avoided when possible. > For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in > python) parses all statistics of all files and all columns. While when doing > a filtered read, you might only need the statistics of certain files (eg if a > filter on a partition field already excluded many files) and certain columns > (eg only the columns on which you are actually filtering). > The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a > later EnsureCompleteMetadata parse all statistics, and don't allow parsing a > subset, or only parsing the other (non-statistics) metadata, ...), so I think > we should try to think of better abstractions. > cc [~rjzamora] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)