[ https://issues.apache.org/jira/browse/ARROW-9321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-9321: ---------------------------------- Labels: dataset dataset-dask-integration pull-request-available (was: dataset dataset-dask-integration) > [C++][Dataset] Allow to "collect" statistics for ParquetFragment row groups > if not constructed from _metadata > ------------------------------------------------------------------------------------------------------------- > > Key: ARROW-9321 > URL: https://issues.apache.org/jira/browse/ARROW-9321 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Joris Van den Bossche > Assignee: Ben Kietzman > Priority: Major > Labels: dataset, dataset-dask-integration, pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Right now, the statistics of the {{RowGroupInfo}} of ParquetFileFragments are > only available when the dataset was constructed from a {{_metadata}} file: > {code:python} > import pandas as pd > df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)}) > > > # use dask to write partitioned dataset *with* _metadata file > import dask.dataframe as dd > > > ddf = dd.from_pandas(df, npartitions=2) > ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow") > > > import pyarrow.dataset as ds > dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", > partitioning="hive") > dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", > partitioning="hive") > > {code} > {code} > In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups > > > In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups > > > Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>] > In [32]: > list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics > > > Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}} > {code} > For some applications (eg dask), one could want access to those statistics, > even if the original dataset / fragments were not created from a > {{_metadata}} file. This should not happen automatically since it's costly, > but a method to trigger collecting all metadata would be useful. > cc [~rjzamora] -- This message was sent by Atlassian Jira (v8.3.4#803005)