Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Gang Wu Sat, 14 Mar 2026 08:57:42 -0700

Update:

Java POC is ready for IEEE 754 column order combined with nan_count:
https://github.com/apache/parquet-java/pull/3393


The spec PR has been updated earlier to address all comments:
https://github.com/apache/parquet-format/pull/514

Really appreciate any review and feedback!

Best,
Gang




On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote:

> Hello all,
>
> I'm reaching out to help drive this long-running discussion—nearly
> three years now—towards a final resolution. With Jan's authorization,
> and my sincere thanks for his sustained effort, I want to help push
> this issue to the finish line.
>
> To recap, we have two primary proposals on how to handle NaNs in
> statistics and column indexes:
>
>   * IEEE 754 Total Order [1]: Proposes adding a new column order
> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined
> ordering for every float bit pattern, including NaNs and -0/+0,
> allowing writers to include NaNs in min/max and removing ambiguity for
> only-NaN pages.
>   * Combined Approach [2]: Proposes adopting the IEEE 754 total order
> alongside explicit nan_count(s) fields. This approach mandates the
> nan_count(s) when the new order is used and clarifies how to handle
> edge cases from legacy writers.
>
> Based on the recent comments, it appears the combined approach [2] is
> gaining consensus, although the IEEE 754 total order [1] still has
> strong advocates.
>
> I agree with the sentiment that technical direction should be made by
> consensus, not a vote. To that end, I'd like to solicit further
> feedback specifically on the combined approach [2] to see if we can
> achieve the necessary consensus to move forward now.
>
> I recall that the total order proposal [1] already has three PoC
> implementations. For the combined approach [2], I can draft a PoC in
> parquet-java, but to meet the two-implementation requirement, we would
> need one more contributor to step up.
>
> [1] https://github.com/apache/parquet-format/pull/221
> [2] https://github.com/apache/parquet-format/pull/514
>
> Best,
> Gang
>
>
> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn <[email protected]>
> wrote:
> >
> > Hello Jan,
> >
> > Thank you for pushing this through. Apart from some smaller nits, we also
> > really like the current proposal.
> >
> > Thanks,
> > Gijs
> >
> > On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]>
> wrote:
> >
> > > I have started organizing a project[1] in arrow-rs 's Parquet reader
> to try
> > > and implement this proposal.
> > >
> > > Hopefully that can be 1 / 2 open source implementations needed.
> > >
> > > Thanks again for helping drive this along,
> > > Andrew
> > >
> > > [1] https://github.com/apache/arrow-rs/issues/8156
> > >
> > > On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote:
> > >
> > > > I have now tagged
> > > > <
> > >
> https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173
> > > > >
> > > > the people that argued for total order in the initial PR. Let's see
> their
> > > > response.
> > > >
> > > > If I understand the adoption process correctly, the next hurdle to
> > > getting
> > > > this adopted is two open
> > > > source (!) implementations proving its feasibility. We already had
> that
> > > for
> > > > IEEE total order. If we
> > > > prefer the solution with nan counts, we'll need it there as well. I
> > > myself
> > > > work on a proprietary
> > > > implementation, so I'm counting on others here :). Be prepared
> though,
> > > this
> > > > will likely take months
> > > > unless the interest in this topic has risen to a point where people
> are
> > > > eager to jump on the implementation
> > > > right away.
> > > >
> > > > So, I guess it will take some months of soaking time before any
> formal
> > > vote
> > > > can be done
> > > > (given that we reach consensus that this is what we want and we find
> > > people
> > > > for the implementations).
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue <
> [email protected]>:
> > > >
> > > > > Thanks, Jan. I also went through the combined proposal and it looks
> > > > mostly
> > > > > good to me.
> > > > >
> > > > > > First of all, to make it quick: Yes, the solution of having
> > > nan_counts
> > > > > *and* total order, which was brought up multiple times, does work
> and
> > > > > solves more cases than just either of both.
> > > > >
> > > > > Great, then we have a solution for both filtering use cases and for
> > > > moving
> > > > > ahead with total order. And thanks to Andrew for suggesting this as
> > > well
> > > > on
> > > > > the second PR. I think this also looks like this is something that
> > > Orson
> > > > is
> > > > > okay with given his comments on the latest PR.
> > > > >
> > > > > Is there anyone against the combined approach? I don't see a big
> > > downside
> > > > > for anyone. It is compatible with previous stats rules, has a NaN
> > > count,
> > > > > and allows using either type-specific order or total order.
> > > > >
> > > > > Assuming that this satisfies the big objections, I think we should
> wait
> > > > for
> > > > > a few days to make sure everyone has time to check out the new PR
> and
> > > > then
> > > > > vote to adopt it.
> > > > >
> > > > > Ryan
> > > > >
> > > > > On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Thank you Jan -- I read through the new combined proposal, and I
> > > > thought
> > > > > it
> > > > > > looks good and addresses the feedback so far. I left some small
> style
> > > > > > suggestions, but nothing that is required from my perspective
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hey Ryan,
> > > > > > >
> > > > > > > Thanks for chiming in. First of all, to make it quick: Yes, the
> > > > > solution
> > > > > > of
> > > > > > > having nan_counts *and* total order, which was brought up
> multiple
> > > > > times,
> > > > > > > does work and solves more cases than just either of both.
> > > > > > >
> > > > > > > I strongly prefer continuing to discuss the merits of these
> > > > approaches
> > > > > > > > rather than trying to decide with a vote.
> > > > > > >
> > > > > > >
> > > > > > > In theory, I agree that it isn't good to silence a discussion
> by
> > > just
> > > > > > > voting for one possible solution and technical issues should be
> > > > > > discussed.
> > > > > > > However, please note that we have been circling on this for
> over
> > > two
> > > > > > years
> > > > > > > now, including an extended discussion that brought up all
> arguments
> > > > > > > multiple times. This is in stark contrast to the
> > > > > > > speed with which you guys work on the Iceberg spec, for
> example.
> > > > There,
> > > > > > you
> > > > > > > also do not discuss the merits of various solutions for
> multiple
> > > > years.
> > > > > > You
> > > > > > > just pick one and merge it after a *reasonable* time of
> discussion.
> > > > > > > If you had the speed we currently have here, nothing would get
> > > done.
> > > > > > Thus,
> > > > > > > I see this as a clear case of *"the perfect is the enemy of the
> > > > good"*.
> > > > > > > Yes, we can continue looking for the perfect solution,
> > > > > > > but that will likely lead to keeping us at the status quo,
> which is
> > > > the
> > > > > > > worst of them all.
> > > > > > >
> > > > > > > That being said, I'm also happy to create a PR which does both
> > > total
> > > > > > order
> > > > > > > and NaN counts; after all, I just want the issue solved and all
> > > these
> > > > > > > solutions are better than the status quo.
> > > > > > >
> > > > > > > *As this was now suggest by at least three people, I guess it's
> > > worth
> > > > > > > doing, so here you go:
> > > > > https://github.com/apache/parquet-format/pull/514
> > > > > > > <https://github.com/apache/parquet-format/pull/514>*
> > > > > > >
> > > > > > > With this, we should have PRs covering most of the solution
> space.
> > > > > > > (I'm refusing to create a PR with negative and positive
> nan_counts;
> > > > > > > nan_counts + total order has to suffice; the complexity
> madness has
> > > > to
> > > > > > stop
> > > > > > > somewhere)
> > > > > > > I still believe that there was an amount of people who already
> > > found
> > > > > > > nan_counts too complex and therefore wanted IEEE total order,
> and
> > > > these
> > > > > > > people may not like putting on extra complexity,
> > > > > > > but let's see, maybe some have also changed their opinion in
> the
> > > > > > meantime.
> > > > > > >
> > > > > > >
> > > > > > > *Given all this, we can also first do an informal vote where
> > > everyone
> > > > > can
> > > > > > > vote for which of the three their favorite would be.Maybe a
> clear
> > > > > > favorite
> > > > > > > will emerge and then we can vote on this one.*
> > > > > > >
> > > > > > > But of course, we can also take some weeks to discuss the three
> > > > > > solutions,
> > > > > > > now that we have PRs for all of them. I just hope this won't
> make
> > > us
> > > > > > > continue for another 2 years, or an
> > > > > > > infinite stalemate where each solution is vetoed by a PMC
> member.
> > > > > > > (Sorry for becoming a bit cynical here; I have just spent way
> too
> > > > much
> > > > > > time
> > > > > > > of my life with double statistics at this point ;) ...)
> > > > > > >
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jan
> > > > > > >
> > > > > > > Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <
> > > > [email protected]
> > > > > >:
> > > > > > >
> > > > > > > > Regarding the process for this, I strongly prefer continuing
> to
> > > > > discuss
> > > > > > > the
> > > > > > > > merits of these approaches rather than trying to decide with
> a
> > > > vote.
> > > > > I
> > > > > > > > don't think it is a good practice to use a vote to decide on
> a
> > > > > > technical
> > > > > > > > direction. There are very few situations that warrant it and
> I
> > > > don't
> > > > > > > think
> > > > > > > > that this is one of them. While this issue has been open for
> a
> > > long
> > > > > > time,
> > > > > > > > that appears to be the result of it not being anyone's top
> > > priority
> > > > > > > rather
> > > > > > > > than indecision.
> > > > > > > >
> > > > > > > > For the technical merits of these approaches, I think that
> we can
> > > > > find
> > > > > > a
> > > > > > > > middle ground. I agree with Jan that when working with sorted
> > > > values,
> > > > > > we
> > > > > > > > need to know how NaN values were handled and that requires
> using
> > > a
> > > > > > > > well-defined order that includes NaN and its variations
> (because
> > > we
> > > > > > > should
> > > > > > > > not normalize). Using NaN count is not sufficient for
> ordering
> > > > rows.
> > > > > > > >
> > > > > > > > Gijs also brings up good points about how NaN values show up
> in
> > > > > actual
> > > > > > > > datasets: not just when used in place of null, but also as
> the
> > > > result
> > > > > > of
> > > > > > > > normal calculations on abnormal data, like `sqrt(-4.0)` or
> > > > > `log(-1.0)`.
> > > > > > > > Both of those present problems when mixed with valid data
> because
> > > > of
> > > > > > the
> > > > > > > > stats "poisoning" problem, where the range of valid data is
> > > usable
> > > > > > until
> > > > > > > a
> > > > > > > > single NaN is mixed in.
> > > > > > > >
> > > > > > > > Another issue is that NaN is error-prone because "regular"
> > > > comparison
> > > > > > is
> > > > > > > > always false:
> > > > > > > > ```
> > > > > > > > Math.log(-1.0) >= 2 => FALSE
> > > > > > > > Math.log(-1.0) < 2 => FALSE
> > > > > > > > 2 > Math.log(-1.0) => FALSE
> > > > > > > > ```
> > > > > > > >
> > > > > > > > As a result, Iceberg doesn't trust NaN values as either
> lower or
> > > > > upper
> > > > > > > > bounds because we don't want to go back to the code that
> produced
> > > > the
> > > > > > > value
> > > > > > > > to see what the comparison order was to determine whether NaN
> > > > values
> > > > > go
> > > > > > > > before or after others.
> > > > > > > >
> > > > > > > > Total order solves the second issue in theory, but regular
> > > > comparison
> > > > > > is
> > > > > > > > prevalent and not obvious to developers. And it also doesn't
> help
> > > > > when
> > > > > > > NaN
> > > > > > > > is used instead of null. So using total order is not
> sufficient
> > > for
> > > > > > data
> > > > > > > > skipping.
> > > > > > > >
> > > > > > > > I think the right compromise is to use `min`, `max`, and
> > > > `nan_count`
> > > > > > for
> > > > > > > > data skipping stats (where min and max cannot be NaN) and
> total
> > > > > > ordering
> > > > > > > > for sorting values. That satisfies the data skipping use
> cases
> > > and
> > > > > also
> > > > > > > > gives us an ordering of unaltered values that we can reason
> > > about.
> > > > > > > >
> > > > > > > > Does anyone think that doesn't work?
> > > > > > > >
> > > > > > > > Ryan
> > > > > > > >
> > > > > > > > On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]>
> wrote:
> > > > > > > >
> > > > > > > > > Thanks Jan for your endless effort on this!
> > > > > > > > >
> > > > > > > > > I'm in favor of simplicity and generalism. I think we have
> > > > already
> > > > > > > > debated
> > > > > > > > > a lot
> > > > > > > > > for `nan_count` in [1] and [2] is the reflection of those
> > > > > > discussions.
> > > > > > > > > Therefore
> > > > > > > > > I am inclined to start a vote for [2] unless there is a
> > > > > significantly
> > > > > > > > > better
> > > > > > > > > proposal.
> > > > > > > > >
> > > > > > > > > I would suggest everyone interested in this discussion to
> > > attend
> > > > > the
> > > > > > > > > scheduled
> > > > > > > > > sync on Aug 6th (detailed below) to spread the word to the
> > > > broader
> > > > > > > > > community.
> > > > > > > > > If we can get a consensus on [2], I can help start the
> vote and
> > > > > move
> > > > > > > > > forward.
> > > > > > > > >
> > > > > > > > > *Apache Parquet Community Sync Wednesday, August 6 · 10:00
> –
> > > > > 11:00am
> > > > > > *
> > > > > > > > > *Time zone: America/Los_Angeles*
> > > > > > > > > *Google Meet joining info Video call link:
> > > > > > > > > https://meet.google.com/bhe-rvan-qjk
> > > > > > > > > <https://meet.google.com/bhe-rvan-qjk> *
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/parquet-format/pull/196
> > > > > > > > > [2] https://github.com/apache/parquet-format/pull/221
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Gang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <
> [email protected]>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Gijs,
> > > > > > > > > >
> > > > > > > > > > Thank you for bringing up concrete points, I'm happy to
> > > discuss
> > > > > > them
> > > > > > > in
> > > > > > > > > > detail.
> > > > > > > > > >
> > > > > > > > > > NaNs are less common in the SQL world than in the
> DataFrame
> > > > world
> > > > > > > where
> > > > > > > > > > > NaNs were used for a long time to represent missing
> values.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > You could transcode between NULL to NaN before reading
> and
> > > > > writing
> > > > > > to
> > > > > > > > > > Parquet. You basically mention yourself that NaNs were
> used
> > > for
> > > > > > > missing
> > > > > > > > > > values, i.e., what is commonly a NULL, which wasn't
> > > available.
> > > > > So,
> > > > > > > > > > semantically, transcoding to NULL would even be the sane
> > > thing
> > > > to
> > > > > > do.
> > > > > > > > > Yes,
> > > > > > > > > > that will cost you some cycles, but should be a rather
> > > > > lightweight
> > > > > > > > > > operation in comparison to most other operations, so I
> would
> > > > > argue
> > > > > > > that
> > > > > > > > > it
> > > > > > > > > > won't totally ruin your performance. Similarly, why
> should
> > > > > Parquet
> > > > > > > play
> > > > > > > > > > along with a "hack" that was done in other frameworks
> due to
> > > > > > > > shortcomings
> > > > > > > > > > of those frameworks? So from a philosophical point of
> view, I
> > > > > think
> > > > > > > > > > supporting NaNs better is the wrong thing to do. Rather,
> we
> > > > > should
> > > > > > > be a
> > > > > > > > > > forcing function to align others to better behavior, so
> > > > appling a
> > > > > > bit
> > > > > > > > of
> > > > > > > > > > force might in the long run make people use NULLs also in
> > > > > > DataFrames.
> > > > > > > > > >
> > > > > > > > > > Of course, your argument also goes into the direction of
> > > > > > pragmatism:
> > > > > > > > If a
> > > > > > > > > > large part of the data science world uses NaNs to encode
> > > > missing
> > > > > > > > values,
> > > > > > > > > > then maybe Parquet should accept this de-facto standard
> > > rather
> > > > > than
> > > > > > > > > > fighting it. That is indeed a valid point. The weight of
> it
> > > is
> > > > > > > > debatable
> > > > > > > > > > and my personal conclusion is that it's still not worth
> it,
> > > as
> > > > > you
> > > > > > > can
> > > > > > > > > > transcode between NULLs and NaNs, but I do agree with its
> > > > > validity.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Since the proposal phrases it as a goal to work
> "regardless
> > > of
> > > > > how
> > > > > > > they
> > > > > > > > > > > order NaN w.r.t. other values" this statement feels
> > > > > out-of-place
> > > > > > to
> > > > > > > > me.
> > > > > > > > > > > Most hardware and most people don't care about total
> > > ordering
> > > > > and
> > > > > > > > > needing
> > > > > > > > > > > to take it into account while filtering using
> statistics
> > > > seems
> > > > > > like
> > > > > > > > > > > preferring the special case instead of the common case.
> > > > Almost
> > > > > > > noone
> > > > > > > > > > > filters for specific NaN value bit-patterns. SQL
> engines
> > > that
> > > > > > don't
> > > > > > > > > have
> > > > > > > > > > > IEEE total ordering as their default ordering for
> floats
> > > will
> > > > > > also
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > do more special handling for this.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I disagree with the conclusion this statement draws. The
> > > > current
> > > > > > > > > behavior,
> > > > > > > > > > and nan_counts without total ordering, pose a real
> problem
> > > > here,
> > > > > > even
> > > > > > > > for
> > > > > > > > > > engines that don't care about bit patterns. I do agree
> that
> > > > most
> > > > > > > > database
> > > > > > > > > > engines, including the one I'm working on, do not care
> about
> > > > bit
> > > > > > > > patterns
> > > > > > > > > > and/or sign bits. However, how can our database engine
> know
> > > > > whether
> > > > > > > the
> > > > > > > > > > writer of a Parquet file saw it the same way? It can't.
> > > > > Therefore,
> > > > > > it
> > > > > > > > > > cannot know whether a writer, for example, ordered NaNs
> > > before
> > > > or
> > > > > > > after
> > > > > > > > > all
> > > > > > > > > > other numbers, or maybe ordered them by sign bit. So, if
> our
> > > > > > database
> > > > > > > > > > engine now sees a float column in sorting columns, it
> cannot
> > > > > apply
> > > > > > > any
> > > > > > > > > > optimization without a lot of special casing, as it
> doesn't
> > > > know
> > > > > > > > whether
> > > > > > > > > > NaNs will be before all other values, after all other
> values,
> > > > or
> > > > > > > maybe
> > > > > > > > > > both, depending on sign bit. It could apply contrived
> logic
> > > > that
> > > > > > > tries
> > > > > > > > to
> > > > > > > > > > infer where NaNs were placed from the NaN counts of the
> first
> > > > and
> > > > > > > last
> > > > > > > > > > page, but doing so will be a lot of ugly code that also
> feels
> > > > to
> > > > > be
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > wrong place. I.e., I don't want to need to load pages or
> the
> > > > page
> > > > > > > > index,
> > > > > > > > > > just to reason about a sort order.
> > > > > > > > > >
> > > > > > > > > > SQL engines that don't have
> > > > > > > > > > > IEEE total ordering as their default ordering for
> floats
> > > will
> > > > > > also
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > do more special handling for this.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This code, which I would indeed need to write for our
> engine,
> > > > is
> > > > > > > > > comparably
> > > > > > > > > > trivial. Simply choose the largest possible bit pattern
> as
> > > > > > comparison
> > > > > > > > for
> > > > > > > > > > upper bounds filtering for NaN, and the smallest
> possible bit
> > > > > > pattern
> > > > > > > > for
> > > > > > > > > > lower bounds. It's not more than a few lines of code that
> > > check
> > > > > > > > whether a
> > > > > > > > > > filter is NaN and then replace its value with the
> > > > highest/lowest
> > > > > > NaN
> > > > > > > > bit
> > > > > > > > > > pattern. It is similarly trivial to the special casing I
> need
> > > > to
> > > > > do
> > > > > > > > with
> > > > > > > > > > nan_counts, and it is way more trivial than the extra
> code I
> > > > > would
> > > > > > > need
> > > > > > > > > to
> > > > > > > > > > write for sorting columns, as depicted above.
> > > > > > > > > >
> > > > > > > > > > From a Polars perspective, having a `nan_count` and
> defining
> > > > what
> > > > > > > > > > > happens to the `min` and `max` statistics when a page
> > > > contains
> > > > > > only
> > > > > > > > > NaNs
> > > > > > > > > > is
> > > > > > > > > > > enough to allow for all predicate filtering. I think,
> but
> > > > > correct
> > > > > > > me
> > > > > > > > > if I
> > > > > > > > > > > am wrong, this is also enough for all SQL engines that
> > > don't
> > > > > use
> > > > > > > > total
> > > > > > > > > > > ordering.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It's not fully enough, as depicted above. Sorting columns
> > > would
> > > > > > still
> > > > > > > > not
> > > > > > > > > > work properly.
> > > > > > > > > >
> > > > > > > > > > As for ways forward, I propose merging the `nan_count`
> and
> > > > `sort
> > > > > > > > > ordering`
> > > > > > > > > > > proposals into one to make one proposal
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Note that the initial reason for proposing IEEE total
> order
> > > was
> > > > > > that
> > > > > > > > > people
> > > > > > > > > > in the discussion threads found nan_counts to be too
> complex
> > > > and
> > > > > > too
> > > > > > > > much
> > > > > > > > > > of an undeserving special case (re-read the discussion
> in the
> > > > > > initial
> > > > > > > > PR
> > > > > > > > > > <https://github.com/apache/parquet-format/pull/196> to
> see
> > > the
> > > > > > > > > > rationales).
> > > > > > > > > > So merging both together would go totally against the
> spirit
> > > of
> > > > > why
> > > > > > > > IEEE
> > > > > > > > > > total order was proposed. While it has further upsides,
> the
> > > > main
> > > > > > > reason
> > > > > > > > > was
> > > > > > > > > > indeed to *not have* nan_counts. If now the proposal
> would
> > > even
> > > > > go
> > > > > > to
> > > > > > > > > > positive and negative nan counts (i.e., even more
> > > complexity),
> > > > > this
> > > > > > > > would
> > > > > > > > > > go 180 degrees into the opposite direction of why people
> > > wanted
> > > > > > total
> > > > > > > > > order
> > > > > > > > > > in the first place.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Jan
> > > > > > > > > >
> > > > > > > > > > Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
> > > > > > > > > > <[email protected]>:
> > > > > > > > > >
> > > > > > > > > > > Hello Jan and others,
> > > > > > > > > > >
> > > > > > > > > > > First, let me preface by saying I am quite new here.
> So I
> > > > > > apologize
> > > > > > > > if
> > > > > > > > > > > there is some other better way to bring up these
> concerns.
> > > I
> > > > > > > > understand
> > > > > > > > > > it
> > > > > > > > > > > is very annoying to come in at the 11th hour and start
> > > > bringing
> > > > > > up
> > > > > > > a
> > > > > > > > > > bunch
> > > > > > > > > > > of concerns, but I would also like this to be done
> right. A
> > > > > > > colleague
> > > > > > > > > of
> > > > > > > > > > > mine brought up some concerns and alternative
> approaches in
> > > > the
> > > > > > > > GitHub
> > > > > > > > > > > thread; I will file some of the concerns here as a
> > > response.
> > > > > > > > > > >
> > > > > > > > > > > > Treating NaNs so specially is giving them attention
> they
> > > > > don't
> > > > > > > > > deserve.
> > > > > > > > > > > Most data sets do not contain NaNs. If a use case
> really
> > > > > requires
> > > > > > > > them
> > > > > > > > > > and
> > > > > > > > > > > needs filtering to ignore them, they can store NULL
> > > instead,
> > > > or
> > > > > > > > encode
> > > > > > > > > > them
> > > > > > > > > > > differently. I would prefer the average case over the
> > > special
> > > > > > case
> > > > > > > > > here.
> > > > > > > > > > >
> > > > > > > > > > > NaNs are less common in the SQL world than in the
> DataFrame
> > > > > world
> > > > > > > > where
> > > > > > > > > > > NaNs were used for a long time to represent missing
> values.
> > > > > They
> > > > > > > > still
> > > > > > > > > > > exist with different canonical representations and
> > > different
> > > > > sign
> > > > > > > > > bits. I
> > > > > > > > > > > agree it might not be correct semantically, but sadly
> that
> > > is
> > > > > the
> > > > > > > > world
> > > > > > > > > > we
> > > > > > > > > > > deal with. NumPy and Numba do not have missing data
> > > > > > functionality,
> > > > > > > > > people
> > > > > > > > > > > use NaNs there, and people definitely use that in their
> > > > > > analytical
> > > > > > > > > > > dataflows. Another point that was brought up in the GH
> > > > > discussion
> > > > > > > was
> > > > > > > > > > "what
> > > > > > > > > > > about infinity? You could argue that having infinity in
> > > > > > statistics
> > > > > > > is
> > > > > > > > > > > similarly unuseful as it's too wide of a bound". I
> would
> > > > argue
> > > > > > that
> > > > > > > > > > > infinity is very different as there is no discussion on
> > > what
> > > > > the
> > > > > > > > > ordering
> > > > > > > > > > > or pattern of infinity is. Everyone agrees that
> `min(1.0,
> > > > inf,
> > > > > > > -inf)
> > > > > > > > ==
> > > > > > > > > > > -inf` and each infinity only has a single bit pattern.
> > > > > > > > > > >
> > > > > > > > > > > > It gives a defined order to every bit pattern and
> thus
> > > > > yields a
> > > > > > > > total
> > > > > > > > > > > order, mathematically speaking, which has value by
> itself.
> > > > With
> > > > > > NaN
> > > > > > > > > > counts,
> > > > > > > > > > > it was still undefined how different bit patterns of
> NaNs
> > > > were
> > > > > > > > supposed
> > > > > > > > > > to
> > > > > > > > > > > be ordered, whether NaN was allowed to have a sign bit,
> > > etc.,
> > > > > > > risking
> > > > > > > > > > that
> > > > > > > > > > > different engines could come to different results while
> > > > > filtering
> > > > > > > or
> > > > > > > > > > > sorting values within a file.
> > > > > > > > > > >
> > > > > > > > > > > Since the proposal phrases it as a goal to work
> "regardless
> > > > of
> > > > > > how
> > > > > > > > they
> > > > > > > > > > > order NaN w.r.t. other values" this statement feels
> > > > > out-of-place
> > > > > > to
> > > > > > > > me.
> > > > > > > > > > > Most hardware and most people don't care about total
> > > ordering
> > > > > and
> > > > > > > > > needing
> > > > > > > > > > > to take it into account while filtering using
> statistics
> > > > seems
> > > > > > like
> > > > > > > > > > > preferring the special case instead of the common case.
> > > > Almost
> > > > > > > noone
> > > > > > > > > > > filters for specific NaN value bit-patterns. SQL
> engines
> > > that
> > > > > > don't
> > > > > > > > > have
> > > > > > > > > > > IEEE total ordering as their default ordering for
> floats
> > > will
> > > > > > also
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > do more special handling for this.
> > > > > > > > > > >
> > > > > > > > > > > I also agree with my colleague that doing an approach
> that
> > > is
> > > > > 50%
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > way there will make the barrier to improving it to
> what it
> > > > > > actually
> > > > > > > > > > should
> > > > > > > > > > > be later on much higher.
> > > > > > > > > > >
> > > > > > > > > > > As for ways forward, I propose merging the `nan_count`
> and
> > > > > `sort
> > > > > > > > > > ordering`
> > > > > > > > > > > proposals into one to make one proposal, as they are
> linked
> > > > > > > together,
> > > > > > > > > and
> > > > > > > > > > > moving forward with one without knowing what will
> happen to
> > > > the
> > > > > > > other
> > > > > > > > > > seems
> > > > > > > > > > > unwise. From a Polars perspective, having a
> `nan_count` and
> > > > > > > defining
> > > > > > > > > what
> > > > > > > > > > > happens to the `min` and `max` statistics when a page
> > > > contains
> > > > > > only
> > > > > > > > > NaNs
> > > > > > > > > > is
> > > > > > > > > > > enough to allow for all predicate filtering. I think,
> but
> > > > > correct
> > > > > > > me
> > > > > > > > > if I
> > > > > > > > > > > am wrong, this is also enough for all SQL engines that
> > > don't
> > > > > use
> > > > > > > > total
> > > > > > > > > > > ordering. But if you want to be impartial to the
> engine's
> > > > > > > > > floating-point
> > > > > > > > > > > ordering and allow engines with total ordering to do
> > > > inequality
> > > > > > > > filters
> > > > > > > > > > > when `nan_count > 0` you would need a
> `positive_nan_count`
> > > > and
> > > > > a
> > > > > > > > > > > `negative_nan_count`. I understand the downside with
> Thrift
> > > > > > > > complexity,
> > > > > > > > > > but
> > > > > > > > > > > introducing another sort order is also adding
> complexity
> > > just
> > > > > in
> > > > > > a
> > > > > > > > > > > different place.
> > > > > > > > > > >
> > > > > > > > > > > I would really like to see this move forward, so I hope
> > > these
> > > > > > > > concerns
> > > > > > > > > > help
> > > > > > > > > > > move it forward towards a solution that works for
> everyone.
> > > > > > > > > > >
> > > > > > > > > > > Kind regards,
> > > > > > > > > > > Gijs
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <
> > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I would also be in favor of starting a vote
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > As the author of both the IEEE754 total order
> > > > > > > > > > > > > <https://github.com/apache/parquet-format/pull/221>
> PR
> > > > and
> > > > > > the
> > > > > > > > > > earlier
> > > > > > > > > > > > PR
> > > > > > > > > > > > > that basically proposed `nan_count`
> > > > > > > > > > > > > <https://github.com/apache/parquet-format/pull/196
> >,
> > > my
> > > > > > > current
> > > > > > > > > vote
> > > > > > > > > > > > would
> > > > > > > > > > > > > be for IEEE754 total order.
> > > > > > > > > > > > > Consequently, I would like to request a formal
> vote for
> > > > the
> > > > > > PR
> > > > > > > > > > > > introducing
> > > > > > > > > > > > > IEEE754 total order (
> > > > > > > > > > https://github.com/apache/parquet-format/pull/221
> > > > > > > > > > > ),
> > > > > > > > > > > > > if
> > > > > > > > > > > > > that is possible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My Rationales:
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - It's conceptually simpler. It's easier to
> explain.
> > > > > It's
> > > > > > > > based
> > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > >    IEEE-standardized order predicate.
> > > > > > > > > > > > >    - There are already multiple implementations
> showing
> > > > > > > > > feasibility.
> > > > > > > > > > > This
> > > > > > > > > > > > >    will likely make the adoption quicker.
> > > > > > > > > > > > >    - It gives a defined order to every bit pattern
> and
> > > > thus
> > > > > > > > yields
> > > > > > > > > a
> > > > > > > > > > > > total
> > > > > > > > > > > > >    order, mathematically speaking, which has value
> by
> > > > > itself.
> > > > > > > > With
> > > > > > > > > > NaN
> > > > > > > > > > > > > counts,
> > > > > > > > > > > > >    it was still undefined how different bit
> patterns of
> > > > > NaNs
> > > > > > > were
> > > > > > > > > > > > supposed
> > > > > > > > > > > > > to
> > > > > > > > > > > > >    be ordered, whether NaN was allowed to have a
> sign
> > > > bit,
> > > > > > > etc.,
> > > > > > > > > > > risking
> > > > > > > > > > > > > that
> > > > > > > > > > > > >    different engines could come to different
> results
> > > > while
> > > > > > > > > filtering
> > > > > > > > > > or
> > > > > > > > > > > > >    sorting values within a file.
> > > > > > > > > > > > >    - It also solves sort order completely. With
> > > > nan_counts
> > > > > > > only,
> > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > > >    still undefined whether nans should be sorted
> before
> > > > or
> > > > > > > after
> > > > > > > > > all
> > > > > > > > > > > > values
> > > > > > > > > > > > >    (or both, depending on sign bit), so any file
> > > > including
> > > > > > NaNs
> > > > > > > > > could
> > > > > > > > > > > not
> > > > > > > > > > > > >    really leverage sort order without being
> ambiguous.
> > > > > > > > > > > > >    - It's less complex in thrift. Having fields
> that
> > > only
> > > > > > apply
> > > > > > > > to
> > > > > > > > > a
> > > > > > > > > > > > >    handful of data types is somehow weird. If every
> > > type
> > > > > did
> > > > > > > > this,
> > > > > > > > > we
> > > > > > > > > > > > would
> > > > > > > > > > > > >    have a plethora of non-generic fields in thrift.
> > > > > > > > > > > > >    - Treating NaNs so specially is giving them
> > > attention
> > > > > they
> > > > > > > > don't
> > > > > > > > > > > > >    deserve. Most data sets do not contain NaNs. If
> a
> > > use
> > > > > case
> > > > > > > > > really
> > > > > > > > > > > > > requires
> > > > > > > > > > > > >    them and needs filtering to ignore them, they
> can
> > > > store
> > > > > > NULL
> > > > > > > > > > > instead,
> > > > > > > > > > > > >    or encode them differently. I would prefer the
> > > average
> > > > > > case
> > > > > > > > over
> > > > > > > > > > the
> > > > > > > > > > > > >    special case here.
> > > > > > > > > > > > >    - The majority of the people discussing this so
> far
> > > > seem
> > > > > > to
> > > > > > > > > favor
> > > > > > > > > > > > total
> > > > > > > > > > > > >    order.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Jan
> > > > > > > > > > > > >
> > > > > > > > > > > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu
> <
> > > > > > > > > [email protected]
> > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > As this discussion has been open for more than
> two
> > > > years,
> > > > > > I’d
> > > > > > > > > like
> > > > > > > > > > to
> > > > > > > > > > > > > bump
> > > > > > > > > > > > > > up
> > > > > > > > > > > > > > this thread again to update the progress and
> collect
> > > > > > > feedback.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Background*
> > > > > > > > > > > > > > • Today Parquet’s min/max stats and page index
> omit
> > > > NaNs
> > > > > > > > > entirely.
> > > > > > > > > > > > > > • Engines can’t safely prune floating values
> because
> > > > they
> > > > > > > know
> > > > > > > > > > > nothing
> > > > > > > > > > > > on
> > > > > > > > > > > > > > NaNs.
> > > > > > > > > > > > > > • Column index is disabled if any page contains
> only
> > > > > NaNs.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There are two active proposals as below:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Proposal A - IEEE754TotalOrder* (from the PR
> [1])
> > > > > > > > > > > > > > • Define a new ColumnOrder to include +0, –0 and
> all
> > > > NaN
> > > > > > > > > > > bit‐patterns.
> > > > > > > > > > > > > > • Stats and column index store NaNs if they
> appear.
> > > > > > > > > > > > > > • Three PoC impls are ready: arrow-rs [2],
> duckdb [3]
> > > > and
> > > > > > > > > > > parquet-java
> > > > > > > > > > > > > [4].
> > > > > > > > > > > > > > • For more context of this approach, please
> refer to
> > > > > > > discussion
> > > > > > > > > in
> > > > > > > > > > > [5].
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Proposal B - add nan_count* (from a comment [6]
> to
> > > > [1])
> > > > > > > > > > > > > > • Add `nan_count` to stats and a `nan_counts`
> list to
> > > > > > column
> > > > > > > > > index.
> > > > > > > > > > > > > > • For all‐NaNs cases, write NaN to min/max and
> use
> > > > > > nan_count
> > > > > > > to
> > > > > > > > > > > > > > distinguish.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Both solutions have pros and cons but are way
> better
> > > > than
> > > > > > the
> > > > > > > > > > status
> > > > > > > > > > > > quo
> > > > > > > > > > > > > > today.
> > > > > > > > > > > > > > Please share your thoughts on the two proposals
> > > above,
> > > > or
> > > > > > > maybe
> > > > > > > > > > come
> > > > > > > > > > > up
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > better alternatives. We need consensus on one
> > > proposal
> > > > > and
> > > > > > > move
> > > > > > > > > > > > forward.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > https://github.com/apache/parquet-format/pull/221
> > > > > > > > > > > > > > [2] https://github.com/apache/arrow-rs/pull/7408
> > > > > > > > > > > > > > [3]
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > > > > > > > > > > > > > [4]
> https://github.com/apache/parquet-java/pull/3191
> > > > > > > > > > > > > > [5]
> > > https://github.com/apache/parquet-format/pull/196
> > > > > > > > > > > > > > [6]
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > Gang
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Dear contributors,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My PR has now gathered comments for a week and
> the
> > > > gist
> > > > > > of
> > > > > > > > all
> > > > > > > > > > open
> > > > > > > > > > > > > > issues
> > > > > > > > > > > > > > > is the question of how to encode pages/column
> > > chunks
> > > > > that
> > > > > > > > > contain
> > > > > > > > > > > > only
> > > > > > > > > > > > > > > NaNs. There are different suggestions and I
> don't
> > > see
> > > > > one
> > > > > > > > > common
> > > > > > > > > > > > > favorite
> > > > > > > > > > > > > > > yet.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have outlined three alternatives of how we
> can
> > > > handle
> > > > > > > these
> > > > > > > > > > and I
> > > > > > > > > > > > > want
> > > > > > > > > > > > > > us
> > > > > > > > > > > > > > > to reach a conclusion here, so I can update my
> PR
> > > > > > > accordingly
> > > > > > > > > and
> > > > > > > > > > > > move
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > with it. As this is my first contribution to
> > > > parquet, I
> > > > > > > don't
> > > > > > > > > > know
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > decision processes here. Do we vote? Is there a
> > > > single
> > > > > or
> > > > > > > > group
> > > > > > > > > > of
> > > > > > > > > > > > > > decision
> > > > > > > > > > > > > > > makers? *Please let me know how to come to a
> > > > conclusion
> > > > > > > here;
> > > > > > > > > > what
> > > > > > > > > > > > are
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > next steps?*
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For reference, here are the three alternatives
> I
> > > > > pointed
> > > > > > > out.
> > > > > > > > > You
> > > > > > > > > > > can
> > > > > > > > > > > > > > find
> > > > > > > > > > > > > > > detailed description of their PROs and CONs in
> my
> > > > > > comment:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. My initial proposal, i.e., encoding only-NaN
> > > pages
> > > > > by
> > > > > > > > > > > min=max=NaN.
> > > > > > > > > > > > > > > 2. Adding `num_values` to the ColumnIndex, to
> make
> > > it
> > > > > > > > symmetric
> > > > > > > > > > > with
> > > > > > > > > > > > > > > Statistics in pages & `ColumnMetaData` and to
> > > enable
> > > > > the
> > > > > > > > > > > computation
> > > > > > > > > > > > > > > `num_values - null_count - nan_count == 0`
> > > > > > > > > > > > > > > 3. Adding a `nan_pages` bool list to the column
> > > > index,
> > > > > > > which
> > > > > > > > > > > > indicates
> > > > > > > > > > > > > > > whether a page contains only NaNs
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > > > Jan Finis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to