[ https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kirill Lykov updated ARROW-12513: --------------------------------- Comment: was deleted (was: Most probably something wrong happens here -- https://github.com/apache/arrow/blob/14b75ee71d770ba86999e0e7a0e0b94629b91968/cpp/src/parquet/column_writer.cc#L1456 The null_count is computed, I believe, here -- https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/level_conversion_inc.h#L351 By looking into this code don't get yet why it doesn't work for strings but works for other types. Since this code was modified by [~emkornfield] last time, question to him -- do you think I look into the right place? And if yes, any hint what could be wrong for strings?) > [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics > for dictionary-encoded array with nulls > -------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-12513 > URL: https://issues.apache.org/jira/browse/ARROW-12513 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet, Python > Affects Versions: 1.0.1, 2.0.0, 3.0.0 > Environment: RHEL6 > Reporter: David Beach > Assignee: Kirill Lykov > Priority: Critical > Labels: parquet-statistics > > When writing a Table as Parquet, when the table contains columns represented > as dictionary-encoded arrays, those columns show an incorrect null_count of 0 > in the Parquet metadata. If the same data is saved without > dictionary-encoding the array, then the null_count is correct. > Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0. > NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ > implementation of the Arrow/Parquet writer. > h3. Setup > {code:python} > import pyarrow as pa > from pyarrow import parquet{code} > h3. Bug > (writes a dictionary encoded Arrow array to parquet) > {code:python} > array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) > assert array1.null_count == 5 > array1dict = array1.dictionary_encode() > assert array1dict.null_count == 5 > table = pa.Table.from_arrays([array1dict], ["mycol"]) > parquet.write_table(table, "testtable.parquet") > meta = parquet.read_metadata("testtable.parquet") > meta.row_group(0).column(0).statistics.null_count # RESULT: 0 (WRONG!){code} > h3. Correct > (writes same data without dictionary encoding the Arrow array) > {code:python} > array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) > assert array1.null_count == 5 > table = pa.Table.from_arrays([array1], ["mycol"]) > parquet.write_table(table, "testtable.parquet") > meta = parquet.read_metadata("testtable.parquet") > meta.row_group(0).column(0).statistics.null_count # RESULT: 5 (CORRECT) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)