Re: [DISCUSS] Clarify num_nulls(null_counts) and distinct_counts in Parquet statistics

Andrew Lamb Fri, 16 Aug 2024 02:44:12 -0700

Thank you Xuwei,

Regarding parquet-rs and writing null_counts, I believe the current
behavior (not writing null_count where there are exactly 0 nulls) is a bug
and I made a PR to align the behavior with what the C/C++ and Java writers
do in [1].


The current behavior I think means that parquet files written by parquet-rs
won't have null_count set and thus reading them from other systems may not
be as fast as if those systems knew there were no nulls.

Applications that use parquet-rs to read parquet_files and interpret the
null_count will need to be changed after the upgrade to explicitly continue
the old behavior of "treat no null_count as 0" which is also documented
now.

The (current) behavior has the potential for "wrong results" for systems
that used parquet-rs to read parquet if for files where there are actually
nulls but the null_count field is not set (parquet-rs will report there are
0 nulls). However we haven't had any bug reports and none of the open
source writers write the statistics struct without also writing null_counts.

Hope that help,
andrew




[1]: https://github.com/apache/arrow-rs/pull/6257

On Fri, Aug 16, 2024 at 1:47 AM wish maple <[email protected]> wrote:

> Currently in our Parquet format, we have multiple null_count and
> distinct_count:
>
> 1. Statistics::null_count, which is an optional null-count
> 2. ColumnIndex::null_counts, which is similar to Statistics::null_count,
> but storing
>     in page index
> 3. DataPageHeaderV2::num_nulls, which means "null values count" in a data
> page
> 4. Statistics::distinct_count, which is an optional distinct-count
>
> I've checked the implementation in Parquet-C++, Parquet-Java and
> parquet-rs, for
> null-count:
> On writer side:
> * Parquet-Java and Parquet-C++ would always write null_count, even
>    if the null_count is 0 or the column is a non-nullable column
> * Parquet-rs would not write null_count if null count is 0 previously. This
> is likely to be
>    fixed in [1]
>
> The column-index would be similar.
>
> On reader side:
> * Parquet-java requires `null_count` to be set, otherwise it would regard
> the statistics as
>    "might contains null or not" [2]
> * Parquet-rs regard `num_nulls > 0` as has_nulls, and don't check the
> existence of null
>   [3]. The same properties is `num_nulls >= 0` in parquet-java [4]
>
> For num_nulls, I suggest:
> 1. Writer side should better write num_nulls / null_count even when
> num_nulls is
>     0 or column is not nullable
> 2. Reader should distinguish whether the null-count is set or not. When
> reading a
>     file from parquet-rs. We can convert num-nulls = 0 when it's not set?
>
> distinct_count is more weird in this. I‘ve checked this and find there're
> merely
> implementations that use this. So I wonder
> 1. Would this be exact?
> 2. Is there any use-cases for this?
>
> [1] https://github.com/apache/arrow-rs/issues/6256
> [2]
>
> https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L93
> [3]
>
> https://github.com/apache/arrow-rs/blob/042d725888358c73cd2a0d58868ea5c4bad778f7/parquet/src/file/statistics.rs#L401
> [4]
>
> https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L532
>
> Best,
> Xuwei Fu
>

Re: [DISCUSS] Clarify num_nulls(null_counts) and distinct_counts in Parquet statistics

Reply via email to