[
https://issues.apache.org/jira/browse/ARROW-9321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney resolved ARROW-9321.
---------------------------------
Resolution: Fixed
Issue resolved by pull request 7692
[https://github.com/apache/arrow/pull/7692]
> [C++][Dataset] Allow to "collect" statistics for ParquetFragment row groups
> if not constructed from _metadata
> -------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9321
> URL: https://issues.apache.org/jira/browse/ARROW-9321
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 1.0.0
>
> Time Spent: 3h 40m
> Remaining Estimate: 0h
>
> Right now, the statistics of the {{RowGroupInfo}} of ParquetFileFragments are
> only available when the dataset was constructed from a {{_metadata}} file:
> {code:python}
> import pandas as pd
> df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)})
>
>
> # use dask to write partitioned dataset *with* _metadata file
> import dask.dataframe as dd
>
>
> ddf = dd.from_pandas(df, npartitions=2)
> ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow")
>
>
> import pyarrow.dataset as ds
> dataset_no_metadata = ds.dataset("test_dataset/", format="parquet",
> partitioning="hive")
> dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata",
> partitioning="hive")
>
> {code}
> {code}
> In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups
>
>
> In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups
>
>
> Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>]
> In [32]:
> list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics
>
>
> Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}}
> {code}
> For some applications (eg dask), one could want access to those statistics,
> even if the original dataset / fragments were not created from a
> {{_metadata}} file. This should not happen automatically since it's costly,
> but a method to trigger collecting all metadata would be useful.
> cc [~rjzamora]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)