Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Gang Wu Fri, 01 Aug 2025 08:57:51 -0700

Thanks Jan for your endless effort on this!

I'm in favor of simplicity and generalism. I think we have already debated
a lot
for `nan_count` in [1] and [2] is the reflection of those discussions.
Therefore
I am inclined to start a vote for [2] unless there is a significantly better
proposal.


I would suggest everyone interested in this discussion to attend the
scheduled
sync on Aug 6th (detailed below) to spread the word to the broader
community.
If we can get a consensus on [2], I can help start the vote and move
forward.

*Apache Parquet Community Sync Wednesday, August 6 · 10:00 – 11:00am *
*Time zone: America/Los_Angeles*
*Google Meet joining info Video call link:
https://meet.google.com/bhe-rvan-qjk
<https://meet.google.com/bhe-rvan-qjk> *

[1] https://github.com/apache/parquet-format/pull/196
[2] https://github.com/apache/parquet-format/pull/221

Best,
Gang


On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <[email protected]> wrote:

> Hi Gijs,
>
> Thank you for bringing up concrete points, I'm happy to discuss them in
> detail.
>
> NaNs are less common in the SQL world than in the DataFrame world where
> > NaNs were used for a long time to represent missing values.
>
>
> You could transcode between NULL to NaN before reading and writing to
> Parquet. You basically mention yourself that NaNs were used for missing
> values, i.e., what is commonly a NULL, which wasn't available. So,
> semantically, transcoding to NULL would even be the sane thing to do. Yes,
> that will cost you some cycles, but should be a rather lightweight
> operation in comparison to most other operations, so I would argue that it
> won't totally ruin your performance. Similarly, why should Parquet play
> along with a "hack" that was done in other frameworks due to shortcomings
> of those frameworks? So from a philosophical point of view, I think
> supporting NaNs better is the wrong thing to do. Rather, we should be a
> forcing function to align others to better behavior, so appling a bit of
> force might in the long run make people use NULLs also in DataFrames.
>
> Of course, your argument also goes into the direction of pragmatism: If a
> large part of the data science world uses NaNs to encode missing values,
> then maybe Parquet should accept this de-facto standard rather than
> fighting it. That is indeed a valid point. The weight of it is debatable
> and my personal conclusion is that it's still not worth it, as you can
> transcode between NULLs and NaNs, but I do agree with its validity.
>
>
> Since the proposal phrases it as a goal to work "regardless of how they
> > order NaN w.r.t. other values" this statement feels out-of-place to me.
> > Most hardware and most people don't care about total ordering and needing
> > to take it into account while filtering using statistics seems like
> > preferring the special case instead of the common case. Almost noone
> > filters for specific NaN value bit-patterns. SQL engines that don't have
> > IEEE total ordering as their default ordering for floats will also need
> to
> > do more special handling for this.
>
>
> I disagree with the conclusion this statement draws. The current behavior,
> and nan_counts without total ordering, pose a real problem here, even for
> engines that don't care about bit patterns. I do agree that most database
> engines, including the one I'm working on, do not care about bit patterns
> and/or sign bits. However, how can our database engine know whether the
> writer of a Parquet file saw it the same way? It can't. Therefore, it
> cannot know whether a writer, for example, ordered NaNs before or after all
> other numbers, or maybe ordered them by sign bit. So, if our database
> engine now sees a float column in sorting columns, it cannot apply any
> optimization without a lot of special casing, as it doesn't know whether
> NaNs will be before all other values, after all other values, or maybe
> both, depending on sign bit. It could apply contrived logic that tries to
> infer where NaNs were placed from the NaN counts of the first and last
> page, but doing so will be a lot of ugly code that also feels to be in the
> wrong place. I.e., I don't want to need to load pages or the page index,
> just to reason about a sort order.
>
> SQL engines that don't have
> > IEEE total ordering as their default ordering for floats will also need
> to
> > do more special handling for this.
>
>
> This code, which I would indeed need to write for our engine, is comparably
> trivial. Simply choose the largest possible bit pattern as comparison for
> upper bounds filtering for NaN, and the smallest possible bit pattern for
> lower bounds. It's not more than a few lines of code that check whether a
> filter is NaN and then replace its value with the highest/lowest NaN bit
> pattern. It is similarly trivial to the special casing I need to do with
> nan_counts, and it is way more trivial than the extra code I would need to
> write for sorting columns, as depicted above.
>
> From a Polars perspective, having a `nan_count` and defining what
> > happens to the `min` and `max` statistics when a page contains only NaNs
> is
> > enough to allow for all predicate filtering. I think, but correct me if I
> > am wrong, this is also enough for all SQL engines that don't use total
> > ordering.
>
>
> It's not fully enough, as depicted above. Sorting columns would still not
> work properly.
>
> As for ways forward, I propose merging the `nan_count` and `sort ordering`
> > proposals into one to make one proposal
>
>
> Note that the initial reason for proposing IEEE total order was that people
> in the discussion threads found nan_counts to be too complex and too much
> of an undeserving special case (re-read the discussion in the initial PR
> <https://github.com/apache/parquet-format/pull/196> to see the
> rationales).
> So merging both together would go totally against the spirit of why IEEE
> total order was proposed. While it has further upsides, the main reason was
> indeed to *not have* nan_counts. If now the proposal would even go to
> positive and negative nan counts (i.e., even more complexity), this would
> go 180 degrees into the opposite direction of why people wanted total order
> in the first place.
>
> Cheers,
> Jan
>
> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
> <[email protected]>:
>
> > Hello Jan and others,
> >
> > First, let me preface by saying I am quite new here. So I apologize if
> > there is some other better way to bring up these concerns. I understand
> it
> > is very annoying to come in at the 11th hour and start bringing up a
> bunch
> > of concerns, but I would also like this to be done right. A colleague of
> > mine brought up some concerns and alternative approaches in the GitHub
> > thread; I will file some of the concerns here as a response.
> >
> > > Treating NaNs so specially is giving them attention they don't deserve.
> > Most data sets do not contain NaNs. If a use case really requires them
> and
> > needs filtering to ignore them, they can store NULL instead, or encode
> them
> > differently. I would prefer the average case over the special case here.
> >
> > NaNs are less common in the SQL world than in the DataFrame world where
> > NaNs were used for a long time to represent missing values. They still
> > exist with different canonical representations and different sign bits. I
> > agree it might not be correct semantically, but sadly that is the world
> we
> > deal with. NumPy and Numba do not have missing data functionality, people
> > use NaNs there, and people definitely use that in their analytical
> > dataflows. Another point that was brought up in the GH discussion was
> "what
> > about infinity? You could argue that having infinity in statistics is
> > similarly unuseful as it's too wide of a bound". I would argue that
> > infinity is very different as there is no discussion on what the ordering
> > or pattern of infinity is. Everyone agrees that `min(1.0, inf, -inf) ==
> > -inf` and each infinity only has a single bit pattern.
> >
> > > It gives a defined order to every bit pattern and thus yields a total
> > order, mathematically speaking, which has value by itself. With NaN
> counts,
> > it was still undefined how different bit patterns of NaNs were supposed
> to
> > be ordered, whether NaN was allowed to have a sign bit, etc., risking
> that
> > different engines could come to different results while filtering or
> > sorting values within a file.
> >
> > Since the proposal phrases it as a goal to work "regardless of how they
> > order NaN w.r.t. other values" this statement feels out-of-place to me.
> > Most hardware and most people don't care about total ordering and needing
> > to take it into account while filtering using statistics seems like
> > preferring the special case instead of the common case. Almost noone
> > filters for specific NaN value bit-patterns. SQL engines that don't have
> > IEEE total ordering as their default ordering for floats will also need
> to
> > do more special handling for this.
> >
> > I also agree with my colleague that doing an approach that is 50% of the
> > way there will make the barrier to improving it to what it actually
> should
> > be later on much higher.
> >
> > As for ways forward, I propose merging the `nan_count` and `sort
> ordering`
> > proposals into one to make one proposal, as they are linked together, and
> > moving forward with one without knowing what will happen to the other
> seems
> > unwise. From a Polars perspective, having a `nan_count` and defining what
> > happens to the `min` and `max` statistics when a page contains only NaNs
> is
> > enough to allow for all predicate filtering. I think, but correct me if I
> > am wrong, this is also enough for all SQL engines that don't use total
> > ordering. But if you want to be impartial to the engine's floating-point
> > ordering and allow engines with total ordering to do inequality filters
> > when `nan_count > 0` you would need a `positive_nan_count` and a
> > `negative_nan_count`. I understand the downside with Thrift complexity,
> but
> > introducing another sort order is also adding complexity just in a
> > different place.
> >
> > I would really like to see this move forward, so I hope these concerns
> help
> > move it forward towards a solution that works for everyone.
> >
> > Kind regards,
> > Gijs
> >
> >
> > On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <[email protected]>
> > wrote:
> >
> > > I would also be in favor of starting a vote
> > >
> > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <[email protected]> wrote:
> > >
> > > > As the author of both the IEEE754 total order
> > > > <https://github.com/apache/parquet-format/pull/221> PR and the
> earlier
> > > PR
> > > > that basically proposed `nan_count`
> > > > <https://github.com/apache/parquet-format/pull/196>, my current vote
> > > would
> > > > be for IEEE754 total order.
> > > > Consequently, I would like to request a formal vote for the PR
> > > introducing
> > > > IEEE754 total order (
> https://github.com/apache/parquet-format/pull/221
> > ),
> > > > if
> > > > that is possible.
> > > >
> > > > My Rationales:
> > > >
> > > >    - It's conceptually simpler. It's easier to explain. It's based on
> > an
> > > >    IEEE-standardized order predicate.
> > > >    - There are already multiple implementations showing feasibility.
> > This
> > > >    will likely make the adoption quicker.
> > > >    - It gives a defined order to every bit pattern and thus yields a
> > > total
> > > >    order, mathematically speaking, which has value by itself. With
> NaN
> > > > counts,
> > > >    it was still undefined how different bit patterns of NaNs were
> > > supposed
> > > > to
> > > >    be ordered, whether NaN was allowed to have a sign bit, etc.,
> > risking
> > > > that
> > > >    different engines could come to different results while filtering
> or
> > > >    sorting values within a file.
> > > >    - It also solves sort order completely. With nan_counts only, it
> is
> > > >    still undefined whether nans should be sorted before or after all
> > > values
> > > >    (or both, depending on sign bit), so any file including NaNs could
> > not
> > > >    really leverage sort order without being ambiguous.
> > > >    - It's less complex in thrift. Having fields that only apply to a
> > > >    handful of data types is somehow weird. If every type did this, we
> > > would
> > > >    have a plethora of non-generic fields in thrift.
> > > >    - Treating NaNs so specially is giving them attention they don't
> > > >    deserve. Most data sets do not contain NaNs. If a use case really
> > > > requires
> > > >    them and needs filtering to ignore them, they can store NULL
> > instead,
> > > >    or encode them differently. I would prefer the average case over
> the
> > > >    special case here.
> > > >    - The majority of the people discussing this so far seem to favor
> > > total
> > > >    order.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <[email protected]
> >:
> > > >
> > > > > Hi all,
> > > > >
> > > > > As this discussion has been open for more than two years, I’d like
> to
> > > > bump
> > > > > up
> > > > > this thread again to update the progress and collect feedback.
> > > > >
> > > > > *Background*
> > > > > • Today Parquet’s min/max stats and page index omit NaNs entirely.
> > > > > • Engines can’t safely prune floating values because they know
> > nothing
> > > on
> > > > > NaNs.
> > > > > • Column index is disabled if any page contains only NaNs.
> > > > >
> > > > > There are two active proposals as below:
> > > > >
> > > > > *Proposal A - IEEE754TotalOrder* (from the PR [1])
> > > > > • Define a new ColumnOrder to include +0, –0 and all NaN
> > bit‐patterns.
> > > > > • Stats and column index store NaNs if they appear.
> > > > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and
> > parquet-java
> > > > [4].
> > > > > • For more context of this approach, please refer to discussion in
> > [5].
> > > > >
> > > > > *Proposal B - add nan_count* (from a comment [6] to [1])
> > > > > • Add `nan_count` to stats and a `nan_counts` list to column index.
> > > > > • For all‐NaNs cases, write NaN to min/max and use nan_count to
> > > > > distinguish.
> > > > >
> > > > > Both solutions have pros and cons but are way better than the
> status
> > > quo
> > > > > today.
> > > > > Please share your thoughts on the two proposals above, or maybe
> come
> > up
> > > > > with
> > > > > better alternatives. We need consensus on one proposal and move
> > > forward.
> > > > >
> > > > > [1] https://github.com/apache/parquet-format/pull/221
> > > > > [2] https://github.com/apache/arrow-rs/pull/7408
> > > > > [3]
> > > > >
> > > >
> > >
> >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > > > > [4] https://github.com/apache/parquet-java/pull/3191
> > > > > [5] https://github.com/apache/parquet-format/pull/196
> > > > > [6]
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <[email protected]>
> wrote:
> > > > >
> > > > > > Dear contributors,
> > > > > >
> > > > > > My PR has now gathered comments for a week and the gist of all
> open
> > > > > issues
> > > > > > is the question of how to encode pages/column chunks that contain
> > > only
> > > > > > NaNs. There are different suggestions and I don't see one common
> > > > favorite
> > > > > > yet.
> > > > > >
> > > > > > I have outlined three alternatives of how we can handle these
> and I
> > > > want
> > > > > us
> > > > > > to reach a conclusion here, so I can update my PR accordingly and
> > > move
> > > > on
> > > > > > with it. As this is my first contribution to parquet, I don't
> know
> > > the
> > > > > > decision processes here. Do we vote? Is there a single or group
> of
> > > > > decision
> > > > > > makers? *Please let me know how to come to a conclusion here;
> what
> > > are
> > > > > the
> > > > > > next steps?*
> > > > > >
> > > > > > For reference, here are the three alternatives I pointed out. You
> > can
> > > > > find
> > > > > > detailed description of their PROs and CONs in my comment:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > > > > >
> > > > > > 1. My initial proposal, i.e., encoding only-NaN pages by
> > min=max=NaN.
> > > > > > 2. Adding `num_values` to the ColumnIndex, to make it symmetric
> > with
> > > > > > Statistics in pages & `ColumnMetaData` and to enable the
> > computation
> > > > > > `num_values - null_count - nan_count == 0`
> > > > > > 3. Adding a `nan_pages` bool list to the column index, which
> > > indicates
> > > > > > whether a page contains only NaNs
> > > > > >
> > > > > >
> > > > > > Cheers
> > > > > > Jan Finis
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to