[
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377460#comment-17377460
]
Kirill Lykov commented on ARROW-12513:
--------------------------------------
Looks like there are different call stacks for int32 and strings.
For int32
column_writer.cc:1438
->WriteArrowDense->WriteArrowZeroCopy->WriteBatchSpaced->WriteValuesSpaced
there we compute num_nulls: num_spaced_values - num_values;
and pass the value down
->UpdateSpaced->IncrementNullCount
--------------
while for string:
->WriteArrow->WriteArrowDictionary
there is a suspecious TODO comment by wesm (If some dictionary values are
unobserved...)
here as input we receive array and make dictionary out of it.
Also Array might have null_count != 0, dictionary will have it 0:
see
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1455
This happens because in
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111
we create array without null_count information.
Later this dictionary passed down and we set null_count to 0
->Update
-- Array values has null_count == 0
->IncrementNullCount
But this is not all the differences between two call stacks.
For strings we return here --
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1303
So the value stored in new_null_count is lost.
> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics
> for dictionary-encoded array with nulls
> --------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-12513
> URL: https://issues.apache.org/jira/browse/ARROW-12513
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet, Python
> Affects Versions: 1.0.1, 2.0.0, 3.0.0
> Environment: RHEL6
> Reporter: David Beach
> Assignee: Kirill Lykov
> Priority: Critical
> Labels: parquet-statistics
>
> When writing a Table as Parquet, when the table contains columns represented
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0
> in the Parquet metadata. If the same data is saved without
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count # RESULT: 5 (CORRECT)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)