[ 
https://issues.apache.org/jira/browse/ARROW-9321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9321:
----------------------------------
    Labels: dataset dataset-dask-integration pull-request-available  (was: 
dataset dataset-dask-integration)

> [C++][Dataset] Allow to "collect" statistics for ParquetFragment row groups 
> if not constructed from _metadata
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9321
>                 URL: https://issues.apache.org/jira/browse/ARROW-9321
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, the statistics of the {{RowGroupInfo}} of ParquetFileFragments are 
> only available when the dataset was constructed from a {{_metadata}} file:
> {code:python}
> import pandas as pd
> df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)})            
>                                                                               
>                                               
> # use dask to write partitioned dataset *with* _metadata file
> import dask.dataframe as dd                                                   
>                                                                               
>                                               
> ddf = dd.from_pandas(df, npartitions=2) 
> ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow")       
>                                                                               
>                                 
> import pyarrow.dataset as ds
> dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", 
> partitioning="hive")
> dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", 
> partitioning="hive")                                                          
>                                                        
> {code}
> {code}
> In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups              
>                                                                               
>                                                        
> In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups            
>                                                                               
>                                                        
> Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>]
> In [32]: 
> list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics       
>                                                                               
>                                               
> Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}}
> {code}
> For some applications (eg dask), one could want access to those statistics, 
> even if the original dataset / fragments were not created from a 
> {{_metadata}} file. This should not happen automatically since it's costly, 
> but a method to trigger collecting all metadata would be useful.
> cc [~rjzamora] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to