[
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377579#comment-17377579
]
Weston Pace commented on ARROW-12513:
-------------------------------------
The comment is in the right spot but it's a little unclear what it is saying.
At the risk of being redundant, I'll add some explanation. There are two ways
that a null can be stored in a dictionary encoded array. Keep in mind that
every dictionary contains two arrays. There is a long array of indices and a
short array of values (usually called the "dictionary" but I will call it
"values" to avoid confusion).
The first method, called "MASK", in arrow::compute::DictionaryEncodeOptions is
done by storing a 0 in the validity bit of the long indices array:
{code:java}
auto indices = ArrayFromJSON(int32(), "[3, 0, 0, 0, null, null, 3, 0, null, 3,
0, null]");
auto values = ArrayFromJSON(int64(), "[10, 20, 30, 40]");
{code}
The second method, called "ENCODE" in arrow::compute::DictionaryEncodeOptions
is done by storing a 0 in the values array and then referencing the index of
that null value.
{code:java}
auto indices = ArrayFromJSON(int32(), "[3, 0, 0, 0, 1, 1, 3, 0, 1, 3, 0, 1]");
auto values = ArrayFromJSON(int64(), "[10, null, 30, 40]");
{code}
Both of these represent the (plain encoded) array:
{code:java}
ArrayFromJSON(int64(), "[40, 10, 10, 10, null, null, 40, 10, null, 40, 10,
null]");
{code}
The two approaches can also be mixed. At one point in time (and I'm pretty
sure it is still true today) an array created with ENCODE will not report a
correct null count because the null count does not currently consider anything
other than how many bits are masked.
Currently arrow will prevent you from trying to write parquet data unless the
dictionary is fully MASKed (that is, there are no nulls in the values array)
([https://github.com/apache/arrow/blob/81ff679c47754692224f655dab32cc0936bb5f55/cpp/src/parquet/arrow/path_internal.cc#L753)]
So, when actually writing an Arrow dictionary array to parquet it first splits
the array into its two components (the indices array and the values array). It
is then using the values array to update statistics. This makes sense for
min/max. However, for null, we have already ensured that the values array does
not have a null in it.
So we should be using the indices array for the null count. Given that we know
the values array does not have null we can safely assume that all nulls are
stored in the indices array and that should always give us the correct value.
I can take a stab at a fix if no one else is planning on it.
> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics
> for dictionary-encoded array with nulls
> --------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-12513
> URL: https://issues.apache.org/jira/browse/ARROW-12513
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet, Python
> Affects Versions: 1.0.1, 2.0.0, 3.0.0
> Environment: RHEL6
> Reporter: David Beach
> Assignee: Kirill Lykov
> Priority: Critical
> Labels: parquet-statistics
>
> When writing a Table as Parquet, when the table contains columns represented
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0
> in the Parquet metadata. If the same data is saved without
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count # RESULT: 5 (CORRECT)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)