[jira] [Comment Edited] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

Francois Saint-Jacques (Jira) Tue, 04 Feb 2020 04:53:00 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029822#comment-17029822
 ]


Francois Saint-Jacques edited comment on PARQUET-1783 at 2/4/20 12:51 PM:
--------------------------------------------------------------------------

There's a 
[TODO|https://github.com/apache/arrow/blob/0326ea34b63ae399582a99d60f0d23cc03aaa628/cpp/src/parquet/column_writer.cc#L1179-L1183]
 about it. I would say this is important since otherwise it makes predicate 
pushdown useless with arrow written files and dictionary column types.


was (Author: fsaintjacques):
There's a 
[TODO|https://github.com/apache/arrow/blob/0326ea34b63ae399582a99d60f0d23cc03aaa628/cpp/src/parquet/column_writer.cc#L1179-L1183]
 about it.

> [C++] Parquet statistics wrong for dictionary type
> --------------------------------------------------
>
>                 Key: PARQUET-1783
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1783
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.6.0
>            Reporter: Florian Jetter
>            Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
>     table,
>     "test_parquet",
>     chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> <pyarrow._parquet.Statistics object at 0x1163b5280>
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

Reply via email to