[ 
https://issues.apache.org/jira/browse/ARROW-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157240#comment-17157240
 ] 

Joris Van den Bossche commented on ARROW-9459:
----------------------------------------------

One question is, if we do this, where / of which object it should be an option. 
Since we are collecting statistics in multiple places (during discovery in 
ParquetDatasetFactory, but also in SplitByRowGroups / EnsureMetadata of the 
materialized Fragments, as when actually reading as well, I think), it could be 
a reader option of the format? (ParquetFileFormat)

> [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-9459
>                 URL: https://issues.apache.org/jira/browse/ARROW-9459
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration
>
> See some timing checks here: 
> https://github.com/dask/dask/pull/6346#issuecomment-656548675
> Parsing all statistics, even from a centralized {{_metadata}} file, can be 
> quite expensive. If you know in advance that you are not going to use them 
> (eg you are only going to do filtering on the partition fields, and otherwise 
> read all data), it could be nice to have an option to disable parsing 
> statistics.
> cc [~rjzamora] [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to