Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Gang Wu Fri, 24 Apr 2026 08:50:16 -0700

Update on the progress of PARQUET-2249.

We now have two complete PoC implementations for the combined IEEE 754
total order and nan_count approach:
- Java: https://github.com/apache/parquet-java/pull/3393
- Rust: https://github.com/apache/arrow-rs/pull/9619 (Thanks Ed!)


The spec PR is available here:
https://github.com/apache/parquet-format/pull/514

We have also added a test file to parquet-testing for interoperability
tests, which has been verified by both parquet-java and arrow-rs:
https://github.com/apache/parquet-testing/pull/104

I'd like to encourage everyone to take another look at the current proposal
and implementation. Any feedback or suggestions are welcome. If there are
no further objections, I will move forward with a formal vote.

Best regards,
Gang

On Mon, Mar 16, 2026 at 11:30 AM Gang Wu <[email protected]> wrote:

> Thanks Zehua! Really appreciate it!
>
> On Mon, Mar 16, 2026 at 10:40 AM Zehua Zou <[email protected]> wrote:
>
>> Hello Gang and others,
>>
>> I am willing to implement the C++ POC.
>>
>>
>>
>> > 2026年3月14日 23:56，Gang Wu <[email protected]> 写道：
>> >
>> > Update:
>> >
>> > Java POC is ready for IEEE 754 column order combined with nan_count:
>> > https://github.com/apache/parquet-java/pull/3393
>> >
>> > The spec PR has been updated earlier to address all comments:
>> > https://github.com/apache/parquet-format/pull/514
>> >
>> > Really appreciate any review and feedback!
>> >
>> > Best,
>> > Gang
>> >
>> >
>> >
>> >
>> > On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote:
>> >
>> >> Hello all,
>> >>
>> >> I'm reaching out to help drive this long-running discussion—nearly
>> >> three years now—towards a final resolution. With Jan's authorization,
>> >> and my sincere thanks for his sustained effort, I want to help push
>> >> this issue to the finish line.
>> >>
>> >> To recap, we have two primary proposals on how to handle NaNs in
>> >> statistics and column indexes:
>> >>
>> >> * IEEE 754 Total Order [1]: Proposes adding a new column order
>> >> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined
>> >> ordering for every float bit pattern, including NaNs and -0/+0,
>> >> allowing writers to include NaNs in min/max and removing ambiguity for
>> >> only-NaN pages.
>> >> * Combined Approach [2]: Proposes adopting the IEEE 754 total order
>> >> alongside explicit nan_count(s) fields. This approach mandates the
>> >> nan_count(s) when the new order is used and clarifies how to handle
>> >> edge cases from legacy writers.
>> >>
>> >> Based on the recent comments, it appears the combined approach [2] is
>> >> gaining consensus, although the IEEE 754 total order [1] still has
>> >> strong advocates.
>> >>
>> >> I agree with the sentiment that technical direction should be made by
>> >> consensus, not a vote. To that end, I'd like to solicit further
>> >> feedback specifically on the combined approach [2] to see if we can
>> >> achieve the necessary consensus to move forward now.
>> >>
>> >> I recall that the total order proposal [1] already has three PoC
>> >> implementations. For the combined approach [2], I can draft a PoC in
>> >> parquet-java, but to meet the two-implementation requirement, we would
>> >> need one more contributor to step up.
>> >>
>> >> [1] https://github.com/apache/parquet-format/pull/221
>> >> [2] https://github.com/apache/parquet-format/pull/514
>> >>
>> >> Best,
>> >> Gang
>> >>
>> >>
>> >> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn
>> <[email protected]>
>> >> wrote:
>> >>>
>> >>> Hello Jan,
>> >>>
>> >>> Thank you for pushing this through. Apart from some smaller nits, we
>> also
>> >>> really like the current proposal.
>> >>>
>> >>> Thanks,
>> >>> Gijs
>> >>>
>> >>> On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]>
>> >> wrote:
>> >>>
>> >>>> I have started organizing a project[1] in arrow-rs 's Parquet reader
>> >> to try
>> >>>> and implement this proposal.
>> >>>>
>> >>>> Hopefully that can be 1 / 2 open source implementations needed.
>> >>>>
>> >>>> Thanks again for helping drive this along,
>> >>>> Andrew
>> >>>>
>> >>>> [1] https://github.com/apache/arrow-rs/issues/8156
>> >>>>
>> >>>> On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote:
>> >>>>
>> >>>>> I have now tagged
>> >>>>> <
>> >>>>
>> >>
>> https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173
>> >>>>>>
>> >>>>> the people that argued for total order in the initial PR. Let's see
>> >> their
>> >>>>> response.
>> >>>>>
>> >>>>> If I understand the adoption process correctly, the next hurdle to
>> >>>> getting
>> >>>>> this adopted is two open
>> >>>>> source (!) implementations proving its feasibility. We already had
>> >> that
>> >>>> for
>> >>>>> IEEE total order. If we
>> >>>>> prefer the solution with nan counts, we'll need it there as well. I
>> >>>> myself
>> >>>>> work on a proprietary
>> >>>>> implementation, so I'm counting on others here :). Be prepared
>> >> though,
>> >>>> this
>> >>>>> will likely take months
>> >>>>> unless the interest in this topic has risen to a point where people
>> >> are
>> >>>>> eager to jump on the implementation
>> >>>>> right away.
>> >>>>>
>> >>>>> So, I guess it will take some months of soaking time before any
>> >> formal
>> >>>> vote
>> >>>>> can be done
>> >>>>> (given that we reach consensus that this is what we want and we find
>> >>>> people
>> >>>>> for the implementations).
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Jan
>> >>>>>
>> >>>>> Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue <
>> >> [email protected]>:
>> >>>>>
>> >>>>>> Thanks, Jan. I also went through the combined proposal and it looks
>> >>>>> mostly
>> >>>>>> good to me.
>> >>>>>>
>> >>>>>>> First of all, to make it quick: Yes, the solution of having
>> >>>> nan_counts
>> >>>>>> *and* total order, which was brought up multiple times, does work
>> >> and
>> >>>>>> solves more cases than just either of both.
>> >>>>>>
>> >>>>>> Great, then we have a solution for both filtering use cases and for
>> >>>>> moving
>> >>>>>> ahead with total order. And thanks to Andrew for suggesting this as
>> >>>> well
>> >>>>> on
>> >>>>>> the second PR. I think this also looks like this is something that
>> >>>> Orson
>> >>>>> is
>> >>>>>> okay with given his comments on the latest PR.
>> >>>>>>
>> >>>>>> Is there anyone against the combined approach? I don't see a big
>> >>>> downside
>> >>>>>> for anyone. It is compatible with previous stats rules, has a NaN
>> >>>> count,
>> >>>>>> and allows using either type-specific order or total order.
>> >>>>>>
>> >>>>>> Assuming that this satisfies the big objections, I think we should
>> >> wait
>> >>>>> for
>> >>>>>> a few days to make sure everyone has time to check out the new PR
>> >> and
>> >>>>> then
>> >>>>>> vote to adopt it.
>> >>>>>>
>> >>>>>> Ryan
>> >>>>>>
>> >>>>>> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb <
>> >> [email protected]>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Thank you Jan -- I read through the new combined proposal, and I
>> >>>>> thought
>> >>>>>> it
>> >>>>>>> looks good and addresses the feedback so far. I left some small
>> >> style
>> >>>>>>> suggestions, but nothing that is required from my perspective
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]>
>> >> wrote:
>> >>>>>>>
>> >>>>>>>> Hey Ryan,
>> >>>>>>>>
>> >>>>>>>> Thanks for chiming in. First of all, to make it quick: Yes, the
>> >>>>>> solution
>> >>>>>>> of
>> >>>>>>>> having nan_counts *and* total order, which was brought up
>> >> multiple
>> >>>>>> times,
>> >>>>>>>> does work and solves more cases than just either of both.
>> >>>>>>>>
>> >>>>>>>> I strongly prefer continuing to discuss the merits of these
>> >>>>> approaches
>> >>>>>>>>> rather than trying to decide with a vote.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> In theory, I agree that it isn't good to silence a discussion
>> >> by
>> >>>> just
>> >>>>>>>> voting for one possible solution and technical issues should be
>> >>>>>>> discussed.
>> >>>>>>>> However, please note that we have been circling on this for
>> >> over
>> >>>> two
>> >>>>>>> years
>> >>>>>>>> now, including an extended discussion that brought up all
>> >> arguments
>> >>>>>>>> multiple times. This is in stark contrast to the
>> >>>>>>>> speed with which you guys work on the Iceberg spec, for
>> >> example.
>> >>>>> There,
>> >>>>>>> you
>> >>>>>>>> also do not discuss the merits of various solutions for
>> >> multiple
>> >>>>> years.
>> >>>>>>> You
>> >>>>>>>> just pick one and merge it after a *reasonable* time of
>> >> discussion.
>> >>>>>>>> If you had the speed we currently have here, nothing would get
>> >>>> done.
>> >>>>>>> Thus,
>> >>>>>>>> I see this as a clear case of *"the perfect is the enemy of the
>> >>>>> good"*.
>> >>>>>>>> Yes, we can continue looking for the perfect solution,
>> >>>>>>>> but that will likely lead to keeping us at the status quo,
>> >> which is
>> >>>>> the
>> >>>>>>>> worst of them all.
>> >>>>>>>>
>> >>>>>>>> That being said, I'm also happy to create a PR which does both
>> >>>> total
>> >>>>>>> order
>> >>>>>>>> and NaN counts; after all, I just want the issue solved and all
>> >>>> these
>> >>>>>>>> solutions are better than the status quo.
>> >>>>>>>>
>> >>>>>>>> *As this was now suggest by at least three people, I guess it's
>> >>>> worth
>> >>>>>>>> doing, so here you go:
>> >>>>>> https://github.com/apache/parquet-format/pull/514
>> >>>>>>>> <https://github.com/apache/parquet-format/pull/514>*
>> >>>>>>>>
>> >>>>>>>> With this, we should have PRs covering most of the solution
>> >> space.
>> >>>>>>>> (I'm refusing to create a PR with negative and positive
>> >> nan_counts;
>> >>>>>>>> nan_counts + total order has to suffice; the complexity
>> >> madness has
>> >>>>> to
>> >>>>>>> stop
>> >>>>>>>> somewhere)
>> >>>>>>>> I still believe that there was an amount of people who already
>> >>>> found
>> >>>>>>>> nan_counts too complex and therefore wanted IEEE total order,
>> >> and
>> >>>>> these
>> >>>>>>>> people may not like putting on extra complexity,
>> >>>>>>>> but let's see, maybe some have also changed their opinion in
>> >> the
>> >>>>>>> meantime.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> *Given all this, we can also first do an informal vote where
>> >>>> everyone
>> >>>>>> can
>> >>>>>>>> vote for which of the three their favorite would be.Maybe a
>> >> clear
>> >>>>>>> favorite
>> >>>>>>>> will emerge and then we can vote on this one.*
>> >>>>>>>>
>> >>>>>>>> But of course, we can also take some weeks to discuss the three
>> >>>>>>> solutions,
>> >>>>>>>> now that we have PRs for all of them. I just hope this won't
>> >> make
>> >>>> us
>> >>>>>>>> continue for another 2 years, or an
>> >>>>>>>> infinite stalemate where each solution is vetoed by a PMC
>> >> member.
>> >>>>>>>> (Sorry for becoming a bit cynical here; I have just spent way
>> >> too
>> >>>>> much
>> >>>>>>> time
>> >>>>>>>> of my life with double statistics at this point ;) ...)
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Cheers,
>> >>>>>>>> Jan
>> >>>>>>>>
>> >>>>>>>> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <
>> >>>>> [email protected]
>> >>>>>>> :
>> >>>>>>>>
>> >>>>>>>>> Regarding the process for this, I strongly prefer continuing
>> >> to
>> >>>>>> discuss
>> >>>>>>>> the
>> >>>>>>>>> merits of these approaches rather than trying to decide with
>> >> a
>> >>>>> vote.
>> >>>>>> I
>> >>>>>>>>> don't think it is a good practice to use a vote to decide on
>> >> a
>> >>>>>>> technical
>> >>>>>>>>> direction. There are very few situations that warrant it and
>> >> I
>> >>>>> don't
>> >>>>>>>> think
>> >>>>>>>>> that this is one of them. While this issue has been open for
>> >> a
>> >>>> long
>> >>>>>>> time,
>> >>>>>>>>> that appears to be the result of it not being anyone's top
>> >>>> priority
>> >>>>>>>> rather
>> >>>>>>>>> than indecision.
>> >>>>>>>>>
>> >>>>>>>>> For the technical merits of these approaches, I think that
>> >> we can
>> >>>>>> find
>> >>>>>>> a
>> >>>>>>>>> middle ground. I agree with Jan that when working with sorted
>> >>>>> values,
>> >>>>>>> we
>> >>>>>>>>> need to know how NaN values were handled and that requires
>> >> using
>> >>>> a
>> >>>>>>>>> well-defined order that includes NaN and its variations
>> >> (because
>> >>>> we
>> >>>>>>>> should
>> >>>>>>>>> not normalize). Using NaN count is not sufficient for
>> >> ordering
>> >>>>> rows.
>> >>>>>>>>>
>> >>>>>>>>> Gijs also brings up good points about how NaN values show up
>> >> in
>> >>>>>> actual
>> >>>>>>>>> datasets: not just when used in place of null, but also as
>> >> the
>> >>>>> result
>> >>>>>>> of
>> >>>>>>>>> normal calculations on abnormal data, like `sqrt(-4.0)` or
>> >>>>>> `log(-1.0)`.
>> >>>>>>>>> Both of those present problems when mixed with valid data
>> >> because
>> >>>>> of
>> >>>>>>> the
>> >>>>>>>>> stats "poisoning" problem, where the range of valid data is
>> >>>> usable
>> >>>>>>> until
>> >>>>>>>> a
>> >>>>>>>>> single NaN is mixed in.
>> >>>>>>>>>
>> >>>>>>>>> Another issue is that NaN is error-prone because "regular"
>> >>>>> comparison
>> >>>>>>> is
>> >>>>>>>>> always false:
>> >>>>>>>>> ```
>> >>>>>>>>> Math.log(-1.0) >= 2 => FALSE
>> >>>>>>>>> Math.log(-1.0) < 2 => FALSE
>> >>>>>>>>> 2 > Math.log(-1.0) => FALSE
>> >>>>>>>>> ```
>> >>>>>>>>>
>> >>>>>>>>> As a result, Iceberg doesn't trust NaN values as either
>> >> lower or
>> >>>>>> upper
>> >>>>>>>>> bounds because we don't want to go back to the code that
>> >> produced
>> >>>>> the
>> >>>>>>>> value
>> >>>>>>>>> to see what the comparison order was to determine whether NaN
>> >>>>> values
>> >>>>>> go
>> >>>>>>>>> before or after others.
>> >>>>>>>>>
>> >>>>>>>>> Total order solves the second issue in theory, but regular
>> >>>>> comparison
>> >>>>>>> is
>> >>>>>>>>> prevalent and not obvious to developers. And it also doesn't
>> >> help
>> >>>>>> when
>> >>>>>>>> NaN
>> >>>>>>>>> is used instead of null. So using total order is not
>> >> sufficient
>> >>>> for
>> >>>>>>> data
>> >>>>>>>>> skipping.
>> >>>>>>>>>
>> >>>>>>>>> I think the right compromise is to use `min`, `max`, and
>> >>>>> `nan_count`
>> >>>>>>> for
>> >>>>>>>>> data skipping stats (where min and max cannot be NaN) and
>> >> total
>> >>>>>>> ordering
>> >>>>>>>>> for sorting values. That satisfies the data skipping use
>> >> cases
>> >>>> and
>> >>>>>> also
>> >>>>>>>>> gives us an ordering of unaltered values that we can reason
>> >>>> about.
>> >>>>>>>>>
>> >>>>>>>>> Does anyone think that doesn't work?
>> >>>>>>>>>
>> >>>>>>>>> Ryan
>> >>>>>>>>>
>> >>>>>>>>> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]>
>> >> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Thanks Jan for your endless effort on this!
>> >>>>>>>>>>
>> >>>>>>>>>> I'm in favor of simplicity and generalism. I think we have
>> >>>>> already
>> >>>>>>>>> debated
>> >>>>>>>>>> a lot
>> >>>>>>>>>> for `nan_count` in [1] and [2] is the reflection of those
>> >>>>>>> discussions.
>> >>>>>>>>>> Therefore
>> >>>>>>>>>> I am inclined to start a vote for [2] unless there is a
>> >>>>>> significantly
>> >>>>>>>>>> better
>> >>>>>>>>>> proposal.
>> >>>>>>>>>>
>> >>>>>>>>>> I would suggest everyone interested in this discussion to
>> >>>> attend
>> >>>>>> the
>> >>>>>>>>>> scheduled
>> >>>>>>>>>> sync on Aug 6th (detailed below) to spread the word to the
>> >>>>> broader
>> >>>>>>>>>> community.
>> >>>>>>>>>> If we can get a consensus on [2], I can help start the
>> >> vote and
>> >>>>>> move
>> >>>>>>>>>> forward.
>> >>>>>>>>>>
>> >>>>>>>>>> *Apache Parquet Community Sync Wednesday, August 6 · 10:00
>> >> –
>> >>>>>> 11:00am
>> >>>>>>> *
>> >>>>>>>>>> *Time zone: America/Los_Angeles*
>> >>>>>>>>>> *Google Meet joining info Video call link:
>> >>>>>>>>>> https://meet.google.com/bhe-rvan-qjk
>> >>>>>>>>>> <https://meet.google.com/bhe-rvan-qjk> *
>> >>>>>>>>>>
>> >>>>>>>>>> [1] https://github.com/apache/parquet-format/pull/196
>> >>>>>>>>>> [2] https://github.com/apache/parquet-format/pull/221
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Gang
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <
>> >> [email protected]>
>> >>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi Gijs,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thank you for bringing up concrete points, I'm happy to
>> >>>> discuss
>> >>>>>>> them
>> >>>>>>>> in
>> >>>>>>>>>>> detail.
>> >>>>>>>>>>>
>> >>>>>>>>>>> NaNs are less common in the SQL world than in the
>> >> DataFrame
>> >>>>> world
>> >>>>>>>> where
>> >>>>>>>>>>>> NaNs were used for a long time to represent missing
>> >> values.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> You could transcode between NULL to NaN before reading
>> >> and
>> >>>>>> writing
>> >>>>>>> to
>> >>>>>>>>>>> Parquet. You basically mention yourself that NaNs were
>> >> used
>> >>>> for
>> >>>>>>>> missing
>> >>>>>>>>>>> values, i.e., what is commonly a NULL, which wasn't
>> >>>> available.
>> >>>>>> So,
>> >>>>>>>>>>> semantically, transcoding to NULL would even be the sane
>> >>>> thing
>> >>>>> to
>> >>>>>>> do.
>> >>>>>>>>>> Yes,
>> >>>>>>>>>>> that will cost you some cycles, but should be a rather
>> >>>>>> lightweight
>> >>>>>>>>>>> operation in comparison to most other operations, so I
>> >> would
>> >>>>>> argue
>> >>>>>>>> that
>> >>>>>>>>>> it
>> >>>>>>>>>>> won't totally ruin your performance. Similarly, why
>> >> should
>> >>>>>> Parquet
>> >>>>>>>> play
>> >>>>>>>>>>> along with a "hack" that was done in other frameworks
>> >> due to
>> >>>>>>>>> shortcomings
>> >>>>>>>>>>> of those frameworks? So from a philosophical point of
>> >> view, I
>> >>>>>> think
>> >>>>>>>>>>> supporting NaNs better is the wrong thing to do. Rather,
>> >> we
>> >>>>>> should
>> >>>>>>>> be a
>> >>>>>>>>>>> forcing function to align others to better behavior, so
>> >>>>> appling a
>> >>>>>>> bit
>> >>>>>>>>> of
>> >>>>>>>>>>> force might in the long run make people use NULLs also in
>> >>>>>>> DataFrames.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Of course, your argument also goes into the direction of
>> >>>>>>> pragmatism:
>> >>>>>>>>> If a
>> >>>>>>>>>>> large part of the data science world uses NaNs to encode
>> >>>>> missing
>> >>>>>>>>> values,
>> >>>>>>>>>>> then maybe Parquet should accept this de-facto standard
>> >>>> rather
>> >>>>>> than
>> >>>>>>>>>>> fighting it. That is indeed a valid point. The weight of
>> >> it
>> >>>> is
>> >>>>>>>>> debatable
>> >>>>>>>>>>> and my personal conclusion is that it's still not worth
>> >> it,
>> >>>> as
>> >>>>>> you
>> >>>>>>>> can
>> >>>>>>>>>>> transcode between NULLs and NaNs, but I do agree with its
>> >>>>>> validity.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Since the proposal phrases it as a goal to work
>> >> "regardless
>> >>>> of
>> >>>>>> how
>> >>>>>>>> they
>> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
>> >>>>>> out-of-place
>> >>>>>>> to
>> >>>>>>>>> me.
>> >>>>>>>>>>>> Most hardware and most people don't care about total
>> >>>> ordering
>> >>>>>> and
>> >>>>>>>>>> needing
>> >>>>>>>>>>>> to take it into account while filtering using
>> >> statistics
>> >>>>> seems
>> >>>>>>> like
>> >>>>>>>>>>>> preferring the special case instead of the common case.
>> >>>>> Almost
>> >>>>>>>> noone
>> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
>> >> engines
>> >>>> that
>> >>>>>>> don't
>> >>>>>>>>>> have
>> >>>>>>>>>>>> IEEE total ordering as their default ordering for
>> >> floats
>> >>>> will
>> >>>>>>> also
>> >>>>>>>>> need
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> do more special handling for this.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I disagree with the conclusion this statement draws. The
>> >>>>> current
>> >>>>>>>>>> behavior,
>> >>>>>>>>>>> and nan_counts without total ordering, pose a real
>> >> problem
>> >>>>> here,
>> >>>>>>> even
>> >>>>>>>>> for
>> >>>>>>>>>>> engines that don't care about bit patterns. I do agree
>> >> that
>> >>>>> most
>> >>>>>>>>> database
>> >>>>>>>>>>> engines, including the one I'm working on, do not care
>> >> about
>> >>>>> bit
>> >>>>>>>>> patterns
>> >>>>>>>>>>> and/or sign bits. However, how can our database engine
>> >> know
>> >>>>>> whether
>> >>>>>>>> the
>> >>>>>>>>>>> writer of a Parquet file saw it the same way? It can't.
>> >>>>>> Therefore,
>> >>>>>>> it
>> >>>>>>>>>>> cannot know whether a writer, for example, ordered NaNs
>> >>>> before
>> >>>>> or
>> >>>>>>>> after
>> >>>>>>>>>> all
>> >>>>>>>>>>> other numbers, or maybe ordered them by sign bit. So, if
>> >> our
>> >>>>>>> database
>> >>>>>>>>>>> engine now sees a float column in sorting columns, it
>> >> cannot
>> >>>>>> apply
>> >>>>>>>> any
>> >>>>>>>>>>> optimization without a lot of special casing, as it
>> >> doesn't
>> >>>>> know
>> >>>>>>>>> whether
>> >>>>>>>>>>> NaNs will be before all other values, after all other
>> >> values,
>> >>>>> or
>> >>>>>>>> maybe
>> >>>>>>>>>>> both, depending on sign bit. It could apply contrived
>> >> logic
>> >>>>> that
>> >>>>>>>> tries
>> >>>>>>>>> to
>> >>>>>>>>>>> infer where NaNs were placed from the NaN counts of the
>> >> first
>> >>>>> and
>> >>>>>>>> last
>> >>>>>>>>>>> page, but doing so will be a lot of ugly code that also
>> >> feels
>> >>>>> to
>> >>>>>> be
>> >>>>>>>> in
>> >>>>>>>>>> the
>> >>>>>>>>>>> wrong place. I.e., I don't want to need to load pages or
>> >> the
>> >>>>> page
>> >>>>>>>>> index,
>> >>>>>>>>>>> just to reason about a sort order.
>> >>>>>>>>>>>
>> >>>>>>>>>>> SQL engines that don't have
>> >>>>>>>>>>>> IEEE total ordering as their default ordering for
>> >> floats
>> >>>> will
>> >>>>>>> also
>> >>>>>>>>> need
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> do more special handling for this.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> This code, which I would indeed need to write for our
>> >> engine,
>> >>>>> is
>> >>>>>>>>>> comparably
>> >>>>>>>>>>> trivial. Simply choose the largest possible bit pattern
>> >> as
>> >>>>>>> comparison
>> >>>>>>>>> for
>> >>>>>>>>>>> upper bounds filtering for NaN, and the smallest
>> >> possible bit
>> >>>>>>> pattern
>> >>>>>>>>> for
>> >>>>>>>>>>> lower bounds. It's not more than a few lines of code that
>> >>>> check
>> >>>>>>>>> whether a
>> >>>>>>>>>>> filter is NaN and then replace its value with the
>> >>>>> highest/lowest
>> >>>>>>> NaN
>> >>>>>>>>> bit
>> >>>>>>>>>>> pattern. It is similarly trivial to the special casing I
>> >> need
>> >>>>> to
>> >>>>>> do
>> >>>>>>>>> with
>> >>>>>>>>>>> nan_counts, and it is way more trivial than the extra
>> >> code I
>> >>>>>> would
>> >>>>>>>> need
>> >>>>>>>>>> to
>> >>>>>>>>>>> write for sorting columns, as depicted above.
>> >>>>>>>>>>>
>> >>>>>>>>>>> From a Polars perspective, having a `nan_count` and
>> >> defining
>> >>>>> what
>> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page
>> >>>>> contains
>> >>>>>>> only
>> >>>>>>>>>> NaNs
>> >>>>>>>>>>> is
>> >>>>>>>>>>>> enough to allow for all predicate filtering. I think,
>> >> but
>> >>>>>> correct
>> >>>>>>>> me
>> >>>>>>>>>> if I
>> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
>> >>>> don't
>> >>>>>> use
>> >>>>>>>>> total
>> >>>>>>>>>>>> ordering.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> It's not fully enough, as depicted above. Sorting columns
>> >>>> would
>> >>>>>>> still
>> >>>>>>>>> not
>> >>>>>>>>>>> work properly.
>> >>>>>>>>>>>
>> >>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
>> >> and
>> >>>>> `sort
>> >>>>>>>>>> ordering`
>> >>>>>>>>>>>> proposals into one to make one proposal
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Note that the initial reason for proposing IEEE total
>> >> order
>> >>>> was
>> >>>>>>> that
>> >>>>>>>>>> people
>> >>>>>>>>>>> in the discussion threads found nan_counts to be too
>> >> complex
>> >>>>> and
>> >>>>>>> too
>> >>>>>>>>> much
>> >>>>>>>>>>> of an undeserving special case (re-read the discussion
>> >> in the
>> >>>>>>> initial
>> >>>>>>>>> PR
>> >>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196> to
>> >> see
>> >>>> the
>> >>>>>>>>>>> rationales).
>> >>>>>>>>>>> So merging both together would go totally against the
>> >> spirit
>> >>>> of
>> >>>>>> why
>> >>>>>>>>> IEEE
>> >>>>>>>>>>> total order was proposed. While it has further upsides,
>> >> the
>> >>>>> main
>> >>>>>>>> reason
>> >>>>>>>>>> was
>> >>>>>>>>>>> indeed to *not have* nan_counts. If now the proposal
>> >> would
>> >>>> even
>> >>>>>> go
>> >>>>>>> to
>> >>>>>>>>>>> positive and negative nan counts (i.e., even more
>> >>>> complexity),
>> >>>>>> this
>> >>>>>>>>> would
>> >>>>>>>>>>> go 180 degrees into the opposite direction of why people
>> >>>> wanted
>> >>>>>>> total
>> >>>>>>>>>> order
>> >>>>>>>>>>> in the first place.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Cheers,
>> >>>>>>>>>>> Jan
>> >>>>>>>>>>>
>> >>>>>>>>>>> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
>> >>>>>>>>>>> <[email protected]>:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hello Jan and others,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> First, let me preface by saying I am quite new here.
>> >> So I
>> >>>>>>> apologize
>> >>>>>>>>> if
>> >>>>>>>>>>>> there is some other better way to bring up these
>> >> concerns.
>> >>>> I
>> >>>>>>>>> understand
>> >>>>>>>>>>> it
>> >>>>>>>>>>>> is very annoying to come in at the 11th hour and start
>> >>>>> bringing
>> >>>>>>> up
>> >>>>>>>> a
>> >>>>>>>>>>> bunch
>> >>>>>>>>>>>> of concerns, but I would also like this to be done
>> >> right. A
>> >>>>>>>> colleague
>> >>>>>>>>>> of
>> >>>>>>>>>>>> mine brought up some concerns and alternative
>> >> approaches in
>> >>>>> the
>> >>>>>>>>> GitHub
>> >>>>>>>>>>>> thread; I will file some of the concerns here as a
>> >>>> response.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Treating NaNs so specially is giving them attention
>> >> they
>> >>>>>> don't
>> >>>>>>>>>> deserve.
>> >>>>>>>>>>>> Most data sets do not contain NaNs. If a use case
>> >> really
>> >>>>>> requires
>> >>>>>>>>> them
>> >>>>>>>>>>> and
>> >>>>>>>>>>>> needs filtering to ignore them, they can store NULL
>> >>>> instead,
>> >>>>> or
>> >>>>>>>>> encode
>> >>>>>>>>>>> them
>> >>>>>>>>>>>> differently. I would prefer the average case over the
>> >>>> special
>> >>>>>>> case
>> >>>>>>>>>> here.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> NaNs are less common in the SQL world than in the
>> >> DataFrame
>> >>>>>> world
>> >>>>>>>>> where
>> >>>>>>>>>>>> NaNs were used for a long time to represent missing
>> >> values.
>> >>>>>> They
>> >>>>>>>>> still
>> >>>>>>>>>>>> exist with different canonical representations and
>> >>>> different
>> >>>>>> sign
>> >>>>>>>>>> bits. I
>> >>>>>>>>>>>> agree it might not be correct semantically, but sadly
>> >> that
>> >>>> is
>> >>>>>> the
>> >>>>>>>>> world
>> >>>>>>>>>>> we
>> >>>>>>>>>>>> deal with. NumPy and Numba do not have missing data
>> >>>>>>> functionality,
>> >>>>>>>>>> people
>> >>>>>>>>>>>> use NaNs there, and people definitely use that in their
>> >>>>>>> analytical
>> >>>>>>>>>>>> dataflows. Another point that was brought up in the GH
>> >>>>>> discussion
>> >>>>>>>> was
>> >>>>>>>>>>> "what
>> >>>>>>>>>>>> about infinity? You could argue that having infinity in
>> >>>>>>> statistics
>> >>>>>>>> is
>> >>>>>>>>>>>> similarly unuseful as it's too wide of a bound". I
>> >> would
>> >>>>> argue
>> >>>>>>> that
>> >>>>>>>>>>>> infinity is very different as there is no discussion on
>> >>>> what
>> >>>>>> the
>> >>>>>>>>>> ordering
>> >>>>>>>>>>>> or pattern of infinity is. Everyone agrees that
>> >> `min(1.0,
>> >>>>> inf,
>> >>>>>>>> -inf)
>> >>>>>>>>> ==
>> >>>>>>>>>>>> -inf` and each infinity only has a single bit pattern.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> It gives a defined order to every bit pattern and
>> >> thus
>> >>>>>> yields a
>> >>>>>>>>> total
>> >>>>>>>>>>>> order, mathematically speaking, which has value by
>> >> itself.
>> >>>>> With
>> >>>>>>> NaN
>> >>>>>>>>>>> counts,
>> >>>>>>>>>>>> it was still undefined how different bit patterns of
>> >> NaNs
>> >>>>> were
>> >>>>>>>>> supposed
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> be ordered, whether NaN was allowed to have a sign bit,
>> >>>> etc.,
>> >>>>>>>> risking
>> >>>>>>>>>>> that
>> >>>>>>>>>>>> different engines could come to different results while
>> >>>>>> filtering
>> >>>>>>>> or
>> >>>>>>>>>>>> sorting values within a file.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Since the proposal phrases it as a goal to work
>> >> "regardless
>> >>>>> of
>> >>>>>>> how
>> >>>>>>>>> they
>> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
>> >>>>>> out-of-place
>> >>>>>>> to
>> >>>>>>>>> me.
>> >>>>>>>>>>>> Most hardware and most people don't care about total
>> >>>> ordering
>> >>>>>> and
>> >>>>>>>>>> needing
>> >>>>>>>>>>>> to take it into account while filtering using
>> >> statistics
>> >>>>> seems
>> >>>>>>> like
>> >>>>>>>>>>>> preferring the special case instead of the common case.
>> >>>>> Almost
>> >>>>>>>> noone
>> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
>> >> engines
>> >>>> that
>> >>>>>>> don't
>> >>>>>>>>>> have
>> >>>>>>>>>>>> IEEE total ordering as their default ordering for
>> >> floats
>> >>>> will
>> >>>>>>> also
>> >>>>>>>>> need
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> do more special handling for this.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I also agree with my colleague that doing an approach
>> >> that
>> >>>> is
>> >>>>>> 50%
>> >>>>>>>> of
>> >>>>>>>>>> the
>> >>>>>>>>>>>> way there will make the barrier to improving it to
>> >> what it
>> >>>>>>> actually
>> >>>>>>>>>>> should
>> >>>>>>>>>>>> be later on much higher.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
>> >> and
>> >>>>>> `sort
>> >>>>>>>>>>> ordering`
>> >>>>>>>>>>>> proposals into one to make one proposal, as they are
>> >> linked
>> >>>>>>>> together,
>> >>>>>>>>>> and
>> >>>>>>>>>>>> moving forward with one without knowing what will
>> >> happen to
>> >>>>> the
>> >>>>>>>> other
>> >>>>>>>>>>> seems
>> >>>>>>>>>>>> unwise. From a Polars perspective, having a
>> >> `nan_count` and
>> >>>>>>>> defining
>> >>>>>>>>>> what
>> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page
>> >>>>> contains
>> >>>>>>> only
>> >>>>>>>>>> NaNs
>> >>>>>>>>>>> is
>> >>>>>>>>>>>> enough to allow for all predicate filtering. I think,
>> >> but
>> >>>>>> correct
>> >>>>>>>> me
>> >>>>>>>>>> if I
>> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
>> >>>> don't
>> >>>>>> use
>> >>>>>>>>> total
>> >>>>>>>>>>>> ordering. But if you want to be impartial to the
>> >> engine's
>> >>>>>>>>>> floating-point
>> >>>>>>>>>>>> ordering and allow engines with total ordering to do
>> >>>>> inequality
>> >>>>>>>>> filters
>> >>>>>>>>>>>> when `nan_count > 0` you would need a
>> >> `positive_nan_count`
>> >>>>> and
>> >>>>>> a
>> >>>>>>>>>>>> `negative_nan_count`. I understand the downside with
>> >> Thrift
>> >>>>>>>>> complexity,
>> >>>>>>>>>>> but
>> >>>>>>>>>>>> introducing another sort order is also adding
>> >> complexity
>> >>>> just
>> >>>>>> in
>> >>>>>>> a
>> >>>>>>>>>>>> different place.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I would really like to see this move forward, so I hope
>> >>>> these
>> >>>>>>>>> concerns
>> >>>>>>>>>>> help
>> >>>>>>>>>>>> move it forward towards a solution that works for
>> >> everyone.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Kind regards,
>> >>>>>>>>>>>> Gijs
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <
>> >>>>>>>> [email protected]>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> I would also be in favor of starting a vote
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <
>> >>>>>> [email protected]>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> As the author of both the IEEE754 total order
>> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/221>
>> >> PR
>> >>>>> and
>> >>>>>>> the
>> >>>>>>>>>>> earlier
>> >>>>>>>>>>>>> PR
>> >>>>>>>>>>>>>> that basically proposed `nan_count`
>> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196
>> >>> ,
>> >>>> my
>> >>>>>>>> current
>> >>>>>>>>>> vote
>> >>>>>>>>>>>>> would
>> >>>>>>>>>>>>>> be for IEEE754 total order.
>> >>>>>>>>>>>>>> Consequently, I would like to request a formal
>> >> vote for
>> >>>>> the
>> >>>>>>> PR
>> >>>>>>>>>>>>> introducing
>> >>>>>>>>>>>>>> IEEE754 total order (
>> >>>>>>>>>>> https://github.com/apache/parquet-format/pull/221
>> >>>>>>>>>>>> ),
>> >>>>>>>>>>>>>> if
>> >>>>>>>>>>>>>> that is possible.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> My Rationales:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>  - It's conceptually simpler. It's easier to
>> >> explain.
>> >>>>>> It's
>> >>>>>>>>> based
>> >>>>>>>>>> on
>> >>>>>>>>>>>> an
>> >>>>>>>>>>>>>>  IEEE-standardized order predicate.
>> >>>>>>>>>>>>>>  - There are already multiple implementations
>> >> showing
>> >>>>>>>>>> feasibility.
>> >>>>>>>>>>>> This
>> >>>>>>>>>>>>>>  will likely make the adoption quicker.
>> >>>>>>>>>>>>>>  - It gives a defined order to every bit pattern
>> >> and
>> >>>>> thus
>> >>>>>>>>> yields
>> >>>>>>>>>> a
>> >>>>>>>>>>>>> total
>> >>>>>>>>>>>>>>  order, mathematically speaking, which has value
>> >> by
>> >>>>>> itself.
>> >>>>>>>>> With
>> >>>>>>>>>>> NaN
>> >>>>>>>>>>>>>> counts,
>> >>>>>>>>>>>>>>  it was still undefined how different bit
>> >> patterns of
>> >>>>>> NaNs
>> >>>>>>>> were
>> >>>>>>>>>>>>> supposed
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>  be ordered, whether NaN was allowed to have a
>> >> sign
>> >>>>> bit,
>> >>>>>>>> etc.,
>> >>>>>>>>>>>> risking
>> >>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>  different engines could come to different
>> >> results
>> >>>>> while
>> >>>>>>>>>> filtering
>> >>>>>>>>>>> or
>> >>>>>>>>>>>>>>  sorting values within a file.
>> >>>>>>>>>>>>>>  - It also solves sort order completely. With
>> >>>>> nan_counts
>> >>>>>>>> only,
>> >>>>>>>>> it
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>>  still undefined whether nans should be sorted
>> >> before
>> >>>>> or
>> >>>>>>>> after
>> >>>>>>>>>> all
>> >>>>>>>>>>>>> values
>> >>>>>>>>>>>>>>  (or both, depending on sign bit), so any file
>> >>>>> including
>> >>>>>>> NaNs
>> >>>>>>>>>> could
>> >>>>>>>>>>>> not
>> >>>>>>>>>>>>>>  really leverage sort order without being
>> >> ambiguous.
>> >>>>>>>>>>>>>>  - It's less complex in thrift. Having fields
>> >> that
>> >>>> only
>> >>>>>>> apply
>> >>>>>>>>> to
>> >>>>>>>>>> a
>> >>>>>>>>>>>>>>  handful of data types is somehow weird. If every
>> >>>> type
>> >>>>>> did
>> >>>>>>>>> this,
>> >>>>>>>>>> we
>> >>>>>>>>>>>>> would
>> >>>>>>>>>>>>>>  have a plethora of non-generic fields in thrift.
>> >>>>>>>>>>>>>>  - Treating NaNs so specially is giving them
>> >>>> attention
>> >>>>>> they
>> >>>>>>>>> don't
>> >>>>>>>>>>>>>>  deserve. Most data sets do not contain NaNs. If
>> >> a
>> >>>> use
>> >>>>>> case
>> >>>>>>>>>> really
>> >>>>>>>>>>>>>> requires
>> >>>>>>>>>>>>>>  them and needs filtering to ignore them, they
>> >> can
>> >>>>> store
>> >>>>>>> NULL
>> >>>>>>>>>>>> instead,
>> >>>>>>>>>>>>>>  or encode them differently. I would prefer the
>> >>>> average
>> >>>>>>> case
>> >>>>>>>>> over
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>  special case here.
>> >>>>>>>>>>>>>>  - The majority of the people discussing this so
>> >> far
>> >>>>> seem
>> >>>>>>> to
>> >>>>>>>>>> favor
>> >>>>>>>>>>>>> total
>> >>>>>>>>>>>>>>  order.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Cheers,
>> >>>>>>>>>>>>>> Jan
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu
>> >> <
>> >>>>>>>>>> [email protected]
>> >>>>>>>>>>>> :
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> As this discussion has been open for more than
>> >> two
>> >>>>> years,
>> >>>>>>> I’d
>> >>>>>>>>>> like
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> bump
>> >>>>>>>>>>>>>>> up
>> >>>>>>>>>>>>>>> this thread again to update the progress and
>> >> collect
>> >>>>>>>> feedback.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> *Background*
>> >>>>>>>>>>>>>>> • Today Parquet’s min/max stats and page index
>> >> omit
>> >>>>> NaNs
>> >>>>>>>>>> entirely.
>> >>>>>>>>>>>>>>> • Engines can’t safely prune floating values
>> >> because
>> >>>>> they
>> >>>>>>>> know
>> >>>>>>>>>>>> nothing
>> >>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>> NaNs.
>> >>>>>>>>>>>>>>> • Column index is disabled if any page contains
>> >> only
>> >>>>>> NaNs.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> There are two active proposals as below:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> *Proposal A - IEEE754TotalOrder* (from the PR
>> >> [1])
>> >>>>>>>>>>>>>>> • Define a new ColumnOrder to include +0, –0 and
>> >> all
>> >>>>> NaN
>> >>>>>>>>>>>> bit‐patterns.
>> >>>>>>>>>>>>>>> • Stats and column index store NaNs if they
>> >> appear.
>> >>>>>>>>>>>>>>> • Three PoC impls are ready: arrow-rs [2],
>> >> duckdb [3]
>> >>>>> and
>> >>>>>>>>>>>> parquet-java
>> >>>>>>>>>>>>>> [4].
>> >>>>>>>>>>>>>>> • For more context of this approach, please
>> >> refer to
>> >>>>>>>> discussion
>> >>>>>>>>>> in
>> >>>>>>>>>>>> [5].
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> *Proposal B - add nan_count* (from a comment [6]
>> >> to
>> >>>>> [1])
>> >>>>>>>>>>>>>>> • Add `nan_count` to stats and a `nan_counts`
>> >> list to
>> >>>>>>> column
>> >>>>>>>>>> index.
>> >>>>>>>>>>>>>>> • For all‐NaNs cases, write NaN to min/max and
>> >> use
>> >>>>>>> nan_count
>> >>>>>>>> to
>> >>>>>>>>>>>>>>> distinguish.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Both solutions have pros and cons but are way
>> >> better
>> >>>>> than
>> >>>>>>> the
>> >>>>>>>>>>> status
>> >>>>>>>>>>>>> quo
>> >>>>>>>>>>>>>>> today.
>> >>>>>>>>>>>>>>> Please share your thoughts on the two proposals
>> >>>> above,
>> >>>>> or
>> >>>>>>>> maybe
>> >>>>>>>>>>> come
>> >>>>>>>>>>>> up
>> >>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>> better alternatives. We need consensus on one
>> >>>> proposal
>> >>>>>> and
>> >>>>>>>> move
>> >>>>>>>>>>>>> forward.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> [1]
>> >>>> https://github.com/apache/parquet-format/pull/221
>> >>>>>>>>>>>>>>> [2] https://github.com/apache/arrow-rs/pull/7408
>> >>>>>>>>>>>>>>> [3]
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
>> >>>>>>>>>>>>>>> [4]
>> >> https://github.com/apache/parquet-java/pull/3191
>> >>>>>>>>>>>>>>> [5]
>> >>>> https://github.com/apache/parquet-format/pull/196
>> >>>>>>>>>>>>>>> [6]
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>> Gang
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <
>> >>>>>>> [email protected]
>> >>>>>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Dear contributors,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> My PR has now gathered comments for a week and
>> >> the
>> >>>>> gist
>> >>>>>>> of
>> >>>>>>>>> all
>> >>>>>>>>>>> open
>> >>>>>>>>>>>>>>> issues
>> >>>>>>>>>>>>>>>> is the question of how to encode pages/column
>> >>>> chunks
>> >>>>>> that
>> >>>>>>>>>> contain
>> >>>>>>>>>>>>> only
>> >>>>>>>>>>>>>>>> NaNs. There are different suggestions and I
>> >> don't
>> >>>> see
>> >>>>>> one
>> >>>>>>>>>> common
>> >>>>>>>>>>>>>> favorite
>> >>>>>>>>>>>>>>>> yet.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I have outlined three alternatives of how we
>> >> can
>> >>>>> handle
>> >>>>>>>> these
>> >>>>>>>>>>> and I
>> >>>>>>>>>>>>>> want
>> >>>>>>>>>>>>>>> us
>> >>>>>>>>>>>>>>>> to reach a conclusion here, so I can update my
>> >> PR
>> >>>>>>>> accordingly
>> >>>>>>>>>> and
>> >>>>>>>>>>>>> move
>> >>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>> with it. As this is my first contribution to
>> >>>>> parquet, I
>> >>>>>>>> don't
>> >>>>>>>>>>> know
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> decision processes here. Do we vote? Is there a
>> >>>>> single
>> >>>>>> or
>> >>>>>>>>> group
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>>>> decision
>> >>>>>>>>>>>>>>>> makers? *Please let me know how to come to a
>> >>>>> conclusion
>> >>>>>>>> here;
>> >>>>>>>>>>> what
>> >>>>>>>>>>>>> are
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> next steps?*
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> For reference, here are the three alternatives
>> >> I
>> >>>>>> pointed
>> >>>>>>>> out.
>> >>>>>>>>>> You
>> >>>>>>>>>>>> can
>> >>>>>>>>>>>>>>> find
>> >>>>>>>>>>>>>>>> detailed description of their PROs and CONs in
>> >> my
>> >>>>>>> comment:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> 1. My initial proposal, i.e., encoding only-NaN
>> >>>> pages
>> >>>>>> by
>> >>>>>>>>>>>> min=max=NaN.
>> >>>>>>>>>>>>>>>> 2. Adding `num_values` to the ColumnIndex, to
>> >> make
>> >>>> it
>> >>>>>>>>> symmetric
>> >>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>> Statistics in pages & `ColumnMetaData` and to
>> >>>> enable
>> >>>>>> the
>> >>>>>>>>>>>> computation
>> >>>>>>>>>>>>>>>> `num_values - null_count - nan_count == 0`
>> >>>>>>>>>>>>>>>> 3. Adding a `nan_pages` bool list to the column
>> >>>>> index,
>> >>>>>>>> which
>> >>>>>>>>>>>>> indicates
>> >>>>>>>>>>>>>>>> whether a page contains only NaNs
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Cheers
>> >>>>>>>>>>>>>>>> Jan Finis
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>>
>>
>>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to