Got it -- makes sense -- thank you

On Sat, Aug 17, 2024 at 6:11 AM wish maple <[email protected]> wrote:

> > If it becomes a practical concern, we could look at the writer metadata,
> perhaps, to detect files written by older versions of parquet-rs
>
> Yes, as the code I listed currently, parquet-java will check the writer
> version
> and do some checking when the previous implementation has a "bug" or is
> inconsistent
> in some cases.
>
> If the user thinks it's important it's also a way for the parquet-rs.
>
> Best,
> Xuwei Fu
>
> Andrew Lamb <[email protected]> 于2024年8月17日周六 18:06写道:
>
> > > From my perspective, if arrow-rs can guarantee that "when statistics
> > exist
> > but the num_nulls not exist, it means null_count == 0", we can detect
> > the writer like [1].
> >
> > I agree with this conclusion
> >
> > However, I don't know how we could make that guarantee, given the parquet
> > file could come from other writers which could (in theory) produce
> > statistics exist but null_count not exist
> >
> > If it becomes a practical concern, we could look at the writer metadata,
> > perhaps, to detect files written by older versions of parquet-rs
> >
> > Andrew
> >
> >
> > On Fri, Aug 16, 2024 at 5:57 AM wish maple <[email protected]>
> wrote:
> >
> > > > The (current) behavior has the potential for "wrong results" for
> > systems
> > > > that used parquet-rs to read parquet if for files where there are
> > > actually
> > > > nulls but the null_count field is not set (parquet-rs will report
> there
> > > are
> > > > 0 nulls).
> > >
> > > From my perspective, if arrow-rs can guarantee that "when statistics
> > exist
> > > but the num_nulls not exist, it means null_count == 0", we can detect
> > > the writer like [1]. It's a bit hacking but it works
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L66
> > >
> > > Best,
> > > Xuwei
> > >
> > > Andrew Lamb <[email protected]> 于2024年8月16日周五 17:42写道:
> > >
> > > > Thank you Xuwei,
> > > >
> > > > Regarding parquet-rs and writing null_counts, I believe the current
> > > > behavior (not writing null_count where there are exactly 0 nulls) is
> a
> > > bug
> > > > and I made a PR to align the behavior with what the C/C++ and Java
> > > writers
> > > > do in [1].
> > > >
> > > > The current behavior I think means that parquet files written by
> > > parquet-rs
> > > > won't have null_count set and thus reading them from other systems
> may
> > > not
> > > > be as fast as if those systems knew there were no nulls.
> > > >
> > > > Applications that use parquet-rs to read parquet_files and interpret
> > the
> > > > null_count will need to be changed after the upgrade to explicitly
> > > continue
> > > > the old behavior of "treat no null_count as 0" which is also
> documented
> > > > now.
> > > >
> > > > The (current) behavior has the potential for "wrong results" for
> > systems
> > > > that used parquet-rs to read parquet if for files where there are
> > > actually
> > > > nulls but the null_count field is not set (parquet-rs will report
> there
> > > are
> > > > 0 nulls). However we haven't had any bug reports and none of the open
> > > > source writers write the statistics struct without also writing
> > > > null_counts.
> > > >
> > > > Hope that help,
> > > > andrew
> > > >
> > > >
> > > >
> > > >
> > > > [1]: https://github.com/apache/arrow-rs/pull/6257
> > > >
> > > > On Fri, Aug 16, 2024 at 1:47 AM wish maple <[email protected]>
> > > wrote:
> > > >
> > > > > Currently in our Parquet format, we have multiple null_count and
> > > > > distinct_count:
> > > > >
> > > > > 1. Statistics::null_count, which is an optional null-count
> > > > > 2. ColumnIndex::null_counts, which is similar to
> > > Statistics::null_count,
> > > > > but storing
> > > > >     in page index
> > > > > 3. DataPageHeaderV2::num_nulls, which means "null values count" in
> a
> > > data
> > > > > page
> > > > > 4. Statistics::distinct_count, which is an optional distinct-count
> > > > >
> > > > > I've checked the implementation in Parquet-C++, Parquet-Java and
> > > > > parquet-rs, for
> > > > > null-count:
> > > > > On writer side:
> > > > > * Parquet-Java and Parquet-C++ would always write null_count, even
> > > > >    if the null_count is 0 or the column is a non-nullable column
> > > > > * Parquet-rs would not write null_count if null count is 0
> > previously.
> > > > This
> > > > > is likely to be
> > > > >    fixed in [1]
> > > > >
> > > > > The column-index would be similar.
> > > > >
> > > > > On reader side:
> > > > > * Parquet-java requires `null_count` to be set, otherwise it would
> > > regard
> > > > > the statistics as
> > > > >    "might contains null or not" [2]
> > > > > * Parquet-rs regard `num_nulls > 0` as has_nulls, and don't check
> the
> > > > > existence of null
> > > > >   [3]. The same properties is `num_nulls >= 0` in parquet-java [4]
> > > > >
> > > > > For num_nulls, I suggest:
> > > > > 1. Writer side should better write num_nulls / null_count even when
> > > > > num_nulls is
> > > > >     0 or column is not nullable
> > > > > 2. Reader should distinguish whether the null-count is set or not.
> > When
> > > > > reading a
> > > > >     file from parquet-rs. We can convert num-nulls = 0 when it's
> not
> > > set?
> > > > >
> > > > > distinct_count is more weird in this. I‘ve checked this and find
> > > there're
> > > > > merely
> > > > > implementations that use this. So I wonder
> > > > > 1. Would this be exact?
> > > > > 2. Is there any use-cases for this?
> > > > >
> > > > > [1] https://github.com/apache/arrow-rs/issues/6256
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L93
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/042d725888358c73cd2a0d58868ea5c4bad778f7/parquet/src/file/statistics.rs#L401
> > > > > [4]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L532
> > > > >
> > > > > Best,
> > > > > Xuwei Fu
> > > > >
> > > >
> > >
> >
>

Reply via email to