[
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378140#comment-17378140
]
Micah Kornfield commented on ARROW-12513:
-----------------------------------------
[~westonpace] my thinking here is it would be better to have accurate
statistics rather then any performance gains. If we ever want to support
parquet indexes we will need to visit each value per page anyways.
> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics
> for dictionary-encoded array with nulls
> --------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-12513
> URL: https://issues.apache.org/jira/browse/ARROW-12513
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet, Python
> Affects Versions: 1.0.1, 2.0.0, 3.0.0
> Environment: RHEL6
> Reporter: David Beach
> Priority: Critical
> Labels: parquet-statistics
>
> When writing a Table as Parquet, when the table contains columns represented
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0
> in the Parquet metadata. If the same data is saved without
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count # RESULT: 5 (CORRECT)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)