[ 
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Lykov updated ARROW-12513:
---------------------------------
    Comment: was deleted

(was: Most probably something wrong happens here -- 
https://github.com/apache/arrow/blob/14b75ee71d770ba86999e0e7a0e0b94629b91968/cpp/src/parquet/column_writer.cc#L1456
The null_count is computed, I believe, here -- 
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/level_conversion_inc.h#L351

By looking into this code don't get yet why it doesn't work for strings but 
works for other types.
Since this code was modified by [~emkornfield] last time, question to him -- do 
you think I look into the right place? And if yes, any hint what could be wrong 
for strings?)

> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics 
> for dictionary-encoded array with nulls
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12513
>                 URL: https://issues.apache.org/jira/browse/ARROW-12513
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet, Python
>    Affects Versions: 1.0.1, 2.0.0, 3.0.0
>         Environment: RHEL6
>            Reporter: David Beach
>            Assignee: Kirill Lykov
>            Priority: Critical
>              Labels: parquet-statistics
>
> When writing a Table as Parquet, when the table contains columns represented 
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0 
> in the Parquet metadata.  If the same data is saved without 
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ 
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to