[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Korn moved ARROW-7732 to PARQUET-1783: ------------------------------------------ Component/s: (was: C++) parquet-cpp Key: PARQUET-1783 (was: ARROW-7732) Affects Version/s: (was: 0.15.1) (was: 0.16.0) cpp-1.6.0 Workflow: patch-available, re-open possible (was: jira) Project: Parquet (was: Apache Arrow) > [C++] Parquet statistics wrong for dictionary type > -------------------------------------------------- > > Key: PARQUET-1783 > URL: https://issues.apache.org/jira/browse/PARQUET-1783 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Affects Versions: cpp-1.6.0 > Reporter: Florian Jetter > Priority: Major > > h3. Observed behaviour > Statistics for categorical data are equivalent for all row groups and refer > to the entire {{CategoricalDtype}} instead of the data included in the row > group. > h3. Expected behaviour > The row group statistics should only include data which is part of the actual > row group, not the entire {{CategoricalDtype}} > h3. Minimal example > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])}) > table = pa.Table.from_pandas(test_df) > pq.write_table( > table, > "test_parquet", > chunk_size=1, > ) > test_parquet = pq.ParquetFile("test_parquet") > test_parquet.metadata.row_group(0).column(0).statistics > {code} > {code:java} > Out[1]: > <pyarrow._parquet.Statistics object at 0x1163b5280> > has_min_max: True > min: 1 > max: 42 > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > logical_type: String > converted_type (legacy): UTF8 > {code} > Expected would be > {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group > > Tested with > pandas==1.0.0 > pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / > essentially 0.16.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)