[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

Uwe Korn (Jira) Tue, 04 Feb 2020 04:39:48 -0800


     [ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Korn moved ARROW-7732 to PARQUET-1783:
------------------------------------------

          Component/s:     (was: C++)
                       parquet-cpp
                  Key: PARQUET-1783  (was: ARROW-7732)
    Affects Version/s:     (was: 0.15.1)
                           (was: 0.16.0)
                       cpp-1.6.0
             Workflow: patch-available, re-open possible  (was: jira)
              Project: Parquet  (was: Apache Arrow)

> [C++] Parquet statistics wrong for dictionary type
> --------------------------------------------------
>
>                 Key: PARQUET-1783
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1783
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.6.0
>            Reporter: Florian Jetter
>            Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
>     table,
>     "test_parquet",
>     chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> <pyarrow._parquet.Statistics object at 0x1163b5280>
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

Reply via email to