Hello Gang and others,

I am willing to implement the C++ POC.



> 2026年3月14日 23:56,Gang Wu <[email protected]> 写道:
> 
> Update:
> 
> Java POC is ready for IEEE 754 column order combined with nan_count:
> https://github.com/apache/parquet-java/pull/3393
> 
> The spec PR has been updated earlier to address all comments:
> https://github.com/apache/parquet-format/pull/514
> 
> Really appreciate any review and feedback!
> 
> Best,
> Gang
> 
> 
> 
> 
> On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote:
> 
>> Hello all,
>> 
>> I'm reaching out to help drive this long-running discussion—nearly
>> three years now—towards a final resolution. With Jan's authorization,
>> and my sincere thanks for his sustained effort, I want to help push
>> this issue to the finish line.
>> 
>> To recap, we have two primary proposals on how to handle NaNs in
>> statistics and column indexes:
>> 
>> * IEEE 754 Total Order [1]: Proposes adding a new column order
>> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined
>> ordering for every float bit pattern, including NaNs and -0/+0,
>> allowing writers to include NaNs in min/max and removing ambiguity for
>> only-NaN pages.
>> * Combined Approach [2]: Proposes adopting the IEEE 754 total order
>> alongside explicit nan_count(s) fields. This approach mandates the
>> nan_count(s) when the new order is used and clarifies how to handle
>> edge cases from legacy writers.
>> 
>> Based on the recent comments, it appears the combined approach [2] is
>> gaining consensus, although the IEEE 754 total order [1] still has
>> strong advocates.
>> 
>> I agree with the sentiment that technical direction should be made by
>> consensus, not a vote. To that end, I'd like to solicit further
>> feedback specifically on the combined approach [2] to see if we can
>> achieve the necessary consensus to move forward now.
>> 
>> I recall that the total order proposal [1] already has three PoC
>> implementations. For the combined approach [2], I can draft a PoC in
>> parquet-java, but to meet the two-implementation requirement, we would
>> need one more contributor to step up.
>> 
>> [1] https://github.com/apache/parquet-format/pull/221
>> [2] https://github.com/apache/parquet-format/pull/514
>> 
>> Best,
>> Gang
>> 
>> 
>> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn <[email protected]>
>> wrote:
>>> 
>>> Hello Jan,
>>> 
>>> Thank you for pushing this through. Apart from some smaller nits, we also
>>> really like the current proposal.
>>> 
>>> Thanks,
>>> Gijs
>>> 
>>> On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]>
>> wrote:
>>> 
>>>> I have started organizing a project[1] in arrow-rs 's Parquet reader
>> to try
>>>> and implement this proposal.
>>>> 
>>>> Hopefully that can be 1 / 2 open source implementations needed.
>>>> 
>>>> Thanks again for helping drive this along,
>>>> Andrew
>>>> 
>>>> [1] https://github.com/apache/arrow-rs/issues/8156
>>>> 
>>>> On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote:
>>>> 
>>>>> I have now tagged
>>>>> <
>>>> 
>> https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173
>>>>>> 
>>>>> the people that argued for total order in the initial PR. Let's see
>> their
>>>>> response.
>>>>> 
>>>>> If I understand the adoption process correctly, the next hurdle to
>>>> getting
>>>>> this adopted is two open
>>>>> source (!) implementations proving its feasibility. We already had
>> that
>>>> for
>>>>> IEEE total order. If we
>>>>> prefer the solution with nan counts, we'll need it there as well. I
>>>> myself
>>>>> work on a proprietary
>>>>> implementation, so I'm counting on others here :). Be prepared
>> though,
>>>> this
>>>>> will likely take months
>>>>> unless the interest in this topic has risen to a point where people
>> are
>>>>> eager to jump on the implementation
>>>>> right away.
>>>>> 
>>>>> So, I guess it will take some months of soaking time before any
>> formal
>>>> vote
>>>>> can be done
>>>>> (given that we reach consensus that this is what we want and we find
>>>> people
>>>>> for the implementations).
>>>>> 
>>>>> Cheers,
>>>>> Jan
>>>>> 
>>>>> Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue <
>> [email protected]>:
>>>>> 
>>>>>> Thanks, Jan. I also went through the combined proposal and it looks
>>>>> mostly
>>>>>> good to me.
>>>>>> 
>>>>>>> First of all, to make it quick: Yes, the solution of having
>>>> nan_counts
>>>>>> *and* total order, which was brought up multiple times, does work
>> and
>>>>>> solves more cases than just either of both.
>>>>>> 
>>>>>> Great, then we have a solution for both filtering use cases and for
>>>>> moving
>>>>>> ahead with total order. And thanks to Andrew for suggesting this as
>>>> well
>>>>> on
>>>>>> the second PR. I think this also looks like this is something that
>>>> Orson
>>>>> is
>>>>>> okay with given his comments on the latest PR.
>>>>>> 
>>>>>> Is there anyone against the combined approach? I don't see a big
>>>> downside
>>>>>> for anyone. It is compatible with previous stats rules, has a NaN
>>>> count,
>>>>>> and allows using either type-specific order or total order.
>>>>>> 
>>>>>> Assuming that this satisfies the big objections, I think we should
>> wait
>>>>> for
>>>>>> a few days to make sure everyone has time to check out the new PR
>> and
>>>>> then
>>>>>> vote to adopt it.
>>>>>> 
>>>>>> Ryan
>>>>>> 
>>>>>> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb <
>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Thank you Jan -- I read through the new combined proposal, and I
>>>>> thought
>>>>>> it
>>>>>>> looks good and addresses the feedback so far. I left some small
>> style
>>>>>>> suggestions, but nothing that is required from my perspective
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]>
>> wrote:
>>>>>>> 
>>>>>>>> Hey Ryan,
>>>>>>>> 
>>>>>>>> Thanks for chiming in. First of all, to make it quick: Yes, the
>>>>>> solution
>>>>>>> of
>>>>>>>> having nan_counts *and* total order, which was brought up
>> multiple
>>>>>> times,
>>>>>>>> does work and solves more cases than just either of both.
>>>>>>>> 
>>>>>>>> I strongly prefer continuing to discuss the merits of these
>>>>> approaches
>>>>>>>>> rather than trying to decide with a vote.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> In theory, I agree that it isn't good to silence a discussion
>> by
>>>> just
>>>>>>>> voting for one possible solution and technical issues should be
>>>>>>> discussed.
>>>>>>>> However, please note that we have been circling on this for
>> over
>>>> two
>>>>>>> years
>>>>>>>> now, including an extended discussion that brought up all
>> arguments
>>>>>>>> multiple times. This is in stark contrast to the
>>>>>>>> speed with which you guys work on the Iceberg spec, for
>> example.
>>>>> There,
>>>>>>> you
>>>>>>>> also do not discuss the merits of various solutions for
>> multiple
>>>>> years.
>>>>>>> You
>>>>>>>> just pick one and merge it after a *reasonable* time of
>> discussion.
>>>>>>>> If you had the speed we currently have here, nothing would get
>>>> done.
>>>>>>> Thus,
>>>>>>>> I see this as a clear case of *"the perfect is the enemy of the
>>>>> good"*.
>>>>>>>> Yes, we can continue looking for the perfect solution,
>>>>>>>> but that will likely lead to keeping us at the status quo,
>> which is
>>>>> the
>>>>>>>> worst of them all.
>>>>>>>> 
>>>>>>>> That being said, I'm also happy to create a PR which does both
>>>> total
>>>>>>> order
>>>>>>>> and NaN counts; after all, I just want the issue solved and all
>>>> these
>>>>>>>> solutions are better than the status quo.
>>>>>>>> 
>>>>>>>> *As this was now suggest by at least three people, I guess it's
>>>> worth
>>>>>>>> doing, so here you go:
>>>>>> https://github.com/apache/parquet-format/pull/514
>>>>>>>> <https://github.com/apache/parquet-format/pull/514>*
>>>>>>>> 
>>>>>>>> With this, we should have PRs covering most of the solution
>> space.
>>>>>>>> (I'm refusing to create a PR with negative and positive
>> nan_counts;
>>>>>>>> nan_counts + total order has to suffice; the complexity
>> madness has
>>>>> to
>>>>>>> stop
>>>>>>>> somewhere)
>>>>>>>> I still believe that there was an amount of people who already
>>>> found
>>>>>>>> nan_counts too complex and therefore wanted IEEE total order,
>> and
>>>>> these
>>>>>>>> people may not like putting on extra complexity,
>>>>>>>> but let's see, maybe some have also changed their opinion in
>> the
>>>>>>> meantime.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *Given all this, we can also first do an informal vote where
>>>> everyone
>>>>>> can
>>>>>>>> vote for which of the three their favorite would be.Maybe a
>> clear
>>>>>>> favorite
>>>>>>>> will emerge and then we can vote on this one.*
>>>>>>>> 
>>>>>>>> But of course, we can also take some weeks to discuss the three
>>>>>>> solutions,
>>>>>>>> now that we have PRs for all of them. I just hope this won't
>> make
>>>> us
>>>>>>>> continue for another 2 years, or an
>>>>>>>> infinite stalemate where each solution is vetoed by a PMC
>> member.
>>>>>>>> (Sorry for becoming a bit cynical here; I have just spent way
>> too
>>>>> much
>>>>>>> time
>>>>>>>> of my life with double statistics at this point ;) ...)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Jan
>>>>>>>> 
>>>>>>>> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <
>>>>> [email protected]
>>>>>>> :
>>>>>>>> 
>>>>>>>>> Regarding the process for this, I strongly prefer continuing
>> to
>>>>>> discuss
>>>>>>>> the
>>>>>>>>> merits of these approaches rather than trying to decide with
>> a
>>>>> vote.
>>>>>> I
>>>>>>>>> don't think it is a good practice to use a vote to decide on
>> a
>>>>>>> technical
>>>>>>>>> direction. There are very few situations that warrant it and
>> I
>>>>> don't
>>>>>>>> think
>>>>>>>>> that this is one of them. While this issue has been open for
>> a
>>>> long
>>>>>>> time,
>>>>>>>>> that appears to be the result of it not being anyone's top
>>>> priority
>>>>>>>> rather
>>>>>>>>> than indecision.
>>>>>>>>> 
>>>>>>>>> For the technical merits of these approaches, I think that
>> we can
>>>>>> find
>>>>>>> a
>>>>>>>>> middle ground. I agree with Jan that when working with sorted
>>>>> values,
>>>>>>> we
>>>>>>>>> need to know how NaN values were handled and that requires
>> using
>>>> a
>>>>>>>>> well-defined order that includes NaN and its variations
>> (because
>>>> we
>>>>>>>> should
>>>>>>>>> not normalize). Using NaN count is not sufficient for
>> ordering
>>>>> rows.
>>>>>>>>> 
>>>>>>>>> Gijs also brings up good points about how NaN values show up
>> in
>>>>>> actual
>>>>>>>>> datasets: not just when used in place of null, but also as
>> the
>>>>> result
>>>>>>> of
>>>>>>>>> normal calculations on abnormal data, like `sqrt(-4.0)` or
>>>>>> `log(-1.0)`.
>>>>>>>>> Both of those present problems when mixed with valid data
>> because
>>>>> of
>>>>>>> the
>>>>>>>>> stats "poisoning" problem, where the range of valid data is
>>>> usable
>>>>>>> until
>>>>>>>> a
>>>>>>>>> single NaN is mixed in.
>>>>>>>>> 
>>>>>>>>> Another issue is that NaN is error-prone because "regular"
>>>>> comparison
>>>>>>> is
>>>>>>>>> always false:
>>>>>>>>> ```
>>>>>>>>> Math.log(-1.0) >= 2 => FALSE
>>>>>>>>> Math.log(-1.0) < 2 => FALSE
>>>>>>>>> 2 > Math.log(-1.0) => FALSE
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> As a result, Iceberg doesn't trust NaN values as either
>> lower or
>>>>>> upper
>>>>>>>>> bounds because we don't want to go back to the code that
>> produced
>>>>> the
>>>>>>>> value
>>>>>>>>> to see what the comparison order was to determine whether NaN
>>>>> values
>>>>>> go
>>>>>>>>> before or after others.
>>>>>>>>> 
>>>>>>>>> Total order solves the second issue in theory, but regular
>>>>> comparison
>>>>>>> is
>>>>>>>>> prevalent and not obvious to developers. And it also doesn't
>> help
>>>>>> when
>>>>>>>> NaN
>>>>>>>>> is used instead of null. So using total order is not
>> sufficient
>>>> for
>>>>>>> data
>>>>>>>>> skipping.
>>>>>>>>> 
>>>>>>>>> I think the right compromise is to use `min`, `max`, and
>>>>> `nan_count`
>>>>>>> for
>>>>>>>>> data skipping stats (where min and max cannot be NaN) and
>> total
>>>>>>> ordering
>>>>>>>>> for sorting values. That satisfies the data skipping use
>> cases
>>>> and
>>>>>> also
>>>>>>>>> gives us an ordering of unaltered values that we can reason
>>>> about.
>>>>>>>>> 
>>>>>>>>> Does anyone think that doesn't work?
>>>>>>>>> 
>>>>>>>>> Ryan
>>>>>>>>> 
>>>>>>>>> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks Jan for your endless effort on this!
>>>>>>>>>> 
>>>>>>>>>> I'm in favor of simplicity and generalism. I think we have
>>>>> already
>>>>>>>>> debated
>>>>>>>>>> a lot
>>>>>>>>>> for `nan_count` in [1] and [2] is the reflection of those
>>>>>>> discussions.
>>>>>>>>>> Therefore
>>>>>>>>>> I am inclined to start a vote for [2] unless there is a
>>>>>> significantly
>>>>>>>>>> better
>>>>>>>>>> proposal.
>>>>>>>>>> 
>>>>>>>>>> I would suggest everyone interested in this discussion to
>>>> attend
>>>>>> the
>>>>>>>>>> scheduled
>>>>>>>>>> sync on Aug 6th (detailed below) to spread the word to the
>>>>> broader
>>>>>>>>>> community.
>>>>>>>>>> If we can get a consensus on [2], I can help start the
>> vote and
>>>>>> move
>>>>>>>>>> forward.
>>>>>>>>>> 
>>>>>>>>>> *Apache Parquet Community Sync Wednesday, August 6 · 10:00
>> –
>>>>>> 11:00am
>>>>>>> *
>>>>>>>>>> *Time zone: America/Los_Angeles*
>>>>>>>>>> *Google Meet joining info Video call link:
>>>>>>>>>> https://meet.google.com/bhe-rvan-qjk
>>>>>>>>>> <https://meet.google.com/bhe-rvan-qjk> *
>>>>>>>>>> 
>>>>>>>>>> [1] https://github.com/apache/parquet-format/pull/196
>>>>>>>>>> [2] https://github.com/apache/parquet-format/pull/221
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Gang
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <
>> [email protected]>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Gijs,
>>>>>>>>>>> 
>>>>>>>>>>> Thank you for bringing up concrete points, I'm happy to
>>>> discuss
>>>>>>> them
>>>>>>>> in
>>>>>>>>>>> detail.
>>>>>>>>>>> 
>>>>>>>>>>> NaNs are less common in the SQL world than in the
>> DataFrame
>>>>> world
>>>>>>>> where
>>>>>>>>>>>> NaNs were used for a long time to represent missing
>> values.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> You could transcode between NULL to NaN before reading
>> and
>>>>>> writing
>>>>>>> to
>>>>>>>>>>> Parquet. You basically mention yourself that NaNs were
>> used
>>>> for
>>>>>>>> missing
>>>>>>>>>>> values, i.e., what is commonly a NULL, which wasn't
>>>> available.
>>>>>> So,
>>>>>>>>>>> semantically, transcoding to NULL would even be the sane
>>>> thing
>>>>> to
>>>>>>> do.
>>>>>>>>>> Yes,
>>>>>>>>>>> that will cost you some cycles, but should be a rather
>>>>>> lightweight
>>>>>>>>>>> operation in comparison to most other operations, so I
>> would
>>>>>> argue
>>>>>>>> that
>>>>>>>>>> it
>>>>>>>>>>> won't totally ruin your performance. Similarly, why
>> should
>>>>>> Parquet
>>>>>>>> play
>>>>>>>>>>> along with a "hack" that was done in other frameworks
>> due to
>>>>>>>>> shortcomings
>>>>>>>>>>> of those frameworks? So from a philosophical point of
>> view, I
>>>>>> think
>>>>>>>>>>> supporting NaNs better is the wrong thing to do. Rather,
>> we
>>>>>> should
>>>>>>>> be a
>>>>>>>>>>> forcing function to align others to better behavior, so
>>>>> appling a
>>>>>>> bit
>>>>>>>>> of
>>>>>>>>>>> force might in the long run make people use NULLs also in
>>>>>>> DataFrames.
>>>>>>>>>>> 
>>>>>>>>>>> Of course, your argument also goes into the direction of
>>>>>>> pragmatism:
>>>>>>>>> If a
>>>>>>>>>>> large part of the data science world uses NaNs to encode
>>>>> missing
>>>>>>>>> values,
>>>>>>>>>>> then maybe Parquet should accept this de-facto standard
>>>> rather
>>>>>> than
>>>>>>>>>>> fighting it. That is indeed a valid point. The weight of
>> it
>>>> is
>>>>>>>>> debatable
>>>>>>>>>>> and my personal conclusion is that it's still not worth
>> it,
>>>> as
>>>>>> you
>>>>>>>> can
>>>>>>>>>>> transcode between NULLs and NaNs, but I do agree with its
>>>>>> validity.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Since the proposal phrases it as a goal to work
>> "regardless
>>>> of
>>>>>> how
>>>>>>>> they
>>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
>>>>>> out-of-place
>>>>>>> to
>>>>>>>>> me.
>>>>>>>>>>>> Most hardware and most people don't care about total
>>>> ordering
>>>>>> and
>>>>>>>>>> needing
>>>>>>>>>>>> to take it into account while filtering using
>> statistics
>>>>> seems
>>>>>>> like
>>>>>>>>>>>> preferring the special case instead of the common case.
>>>>> Almost
>>>>>>>> noone
>>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
>> engines
>>>> that
>>>>>>> don't
>>>>>>>>>> have
>>>>>>>>>>>> IEEE total ordering as their default ordering for
>> floats
>>>> will
>>>>>>> also
>>>>>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>> do more special handling for this.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I disagree with the conclusion this statement draws. The
>>>>> current
>>>>>>>>>> behavior,
>>>>>>>>>>> and nan_counts without total ordering, pose a real
>> problem
>>>>> here,
>>>>>>> even
>>>>>>>>> for
>>>>>>>>>>> engines that don't care about bit patterns. I do agree
>> that
>>>>> most
>>>>>>>>> database
>>>>>>>>>>> engines, including the one I'm working on, do not care
>> about
>>>>> bit
>>>>>>>>> patterns
>>>>>>>>>>> and/or sign bits. However, how can our database engine
>> know
>>>>>> whether
>>>>>>>> the
>>>>>>>>>>> writer of a Parquet file saw it the same way? It can't.
>>>>>> Therefore,
>>>>>>> it
>>>>>>>>>>> cannot know whether a writer, for example, ordered NaNs
>>>> before
>>>>> or
>>>>>>>> after
>>>>>>>>>> all
>>>>>>>>>>> other numbers, or maybe ordered them by sign bit. So, if
>> our
>>>>>>> database
>>>>>>>>>>> engine now sees a float column in sorting columns, it
>> cannot
>>>>>> apply
>>>>>>>> any
>>>>>>>>>>> optimization without a lot of special casing, as it
>> doesn't
>>>>> know
>>>>>>>>> whether
>>>>>>>>>>> NaNs will be before all other values, after all other
>> values,
>>>>> or
>>>>>>>> maybe
>>>>>>>>>>> both, depending on sign bit. It could apply contrived
>> logic
>>>>> that
>>>>>>>> tries
>>>>>>>>> to
>>>>>>>>>>> infer where NaNs were placed from the NaN counts of the
>> first
>>>>> and
>>>>>>>> last
>>>>>>>>>>> page, but doing so will be a lot of ugly code that also
>> feels
>>>>> to
>>>>>> be
>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>> wrong place. I.e., I don't want to need to load pages or
>> the
>>>>> page
>>>>>>>>> index,
>>>>>>>>>>> just to reason about a sort order.
>>>>>>>>>>> 
>>>>>>>>>>> SQL engines that don't have
>>>>>>>>>>>> IEEE total ordering as their default ordering for
>> floats
>>>> will
>>>>>>> also
>>>>>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>> do more special handling for this.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> This code, which I would indeed need to write for our
>> engine,
>>>>> is
>>>>>>>>>> comparably
>>>>>>>>>>> trivial. Simply choose the largest possible bit pattern
>> as
>>>>>>> comparison
>>>>>>>>> for
>>>>>>>>>>> upper bounds filtering for NaN, and the smallest
>> possible bit
>>>>>>> pattern
>>>>>>>>> for
>>>>>>>>>>> lower bounds. It's not more than a few lines of code that
>>>> check
>>>>>>>>> whether a
>>>>>>>>>>> filter is NaN and then replace its value with the
>>>>> highest/lowest
>>>>>>> NaN
>>>>>>>>> bit
>>>>>>>>>>> pattern. It is similarly trivial to the special casing I
>> need
>>>>> to
>>>>>> do
>>>>>>>>> with
>>>>>>>>>>> nan_counts, and it is way more trivial than the extra
>> code I
>>>>>> would
>>>>>>>> need
>>>>>>>>>> to
>>>>>>>>>>> write for sorting columns, as depicted above.
>>>>>>>>>>> 
>>>>>>>>>>> From a Polars perspective, having a `nan_count` and
>> defining
>>>>> what
>>>>>>>>>>>> happens to the `min` and `max` statistics when a page
>>>>> contains
>>>>>>> only
>>>>>>>>>> NaNs
>>>>>>>>>>> is
>>>>>>>>>>>> enough to allow for all predicate filtering. I think,
>> but
>>>>>> correct
>>>>>>>> me
>>>>>>>>>> if I
>>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
>>>> don't
>>>>>> use
>>>>>>>>> total
>>>>>>>>>>>> ordering.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> It's not fully enough, as depicted above. Sorting columns
>>>> would
>>>>>>> still
>>>>>>>>> not
>>>>>>>>>>> work properly.
>>>>>>>>>>> 
>>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
>> and
>>>>> `sort
>>>>>>>>>> ordering`
>>>>>>>>>>>> proposals into one to make one proposal
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Note that the initial reason for proposing IEEE total
>> order
>>>> was
>>>>>>> that
>>>>>>>>>> people
>>>>>>>>>>> in the discussion threads found nan_counts to be too
>> complex
>>>>> and
>>>>>>> too
>>>>>>>>> much
>>>>>>>>>>> of an undeserving special case (re-read the discussion
>> in the
>>>>>>> initial
>>>>>>>>> PR
>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196> to
>> see
>>>> the
>>>>>>>>>>> rationales).
>>>>>>>>>>> So merging both together would go totally against the
>> spirit
>>>> of
>>>>>> why
>>>>>>>>> IEEE
>>>>>>>>>>> total order was proposed. While it has further upsides,
>> the
>>>>> main
>>>>>>>> reason
>>>>>>>>>> was
>>>>>>>>>>> indeed to *not have* nan_counts. If now the proposal
>> would
>>>> even
>>>>>> go
>>>>>>> to
>>>>>>>>>>> positive and negative nan counts (i.e., even more
>>>> complexity),
>>>>>> this
>>>>>>>>> would
>>>>>>>>>>> go 180 degrees into the opposite direction of why people
>>>> wanted
>>>>>>> total
>>>>>>>>>> order
>>>>>>>>>>> in the first place.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Jan
>>>>>>>>>>> 
>>>>>>>>>>> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>> 
>>>>>>>>>>>> Hello Jan and others,
>>>>>>>>>>>> 
>>>>>>>>>>>> First, let me preface by saying I am quite new here.
>> So I
>>>>>>> apologize
>>>>>>>>> if
>>>>>>>>>>>> there is some other better way to bring up these
>> concerns.
>>>> I
>>>>>>>>> understand
>>>>>>>>>>> it
>>>>>>>>>>>> is very annoying to come in at the 11th hour and start
>>>>> bringing
>>>>>>> up
>>>>>>>> a
>>>>>>>>>>> bunch
>>>>>>>>>>>> of concerns, but I would also like this to be done
>> right. A
>>>>>>>> colleague
>>>>>>>>>> of
>>>>>>>>>>>> mine brought up some concerns and alternative
>> approaches in
>>>>> the
>>>>>>>>> GitHub
>>>>>>>>>>>> thread; I will file some of the concerns here as a
>>>> response.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Treating NaNs so specially is giving them attention
>> they
>>>>>> don't
>>>>>>>>>> deserve.
>>>>>>>>>>>> Most data sets do not contain NaNs. If a use case
>> really
>>>>>> requires
>>>>>>>>> them
>>>>>>>>>>> and
>>>>>>>>>>>> needs filtering to ignore them, they can store NULL
>>>> instead,
>>>>> or
>>>>>>>>> encode
>>>>>>>>>>> them
>>>>>>>>>>>> differently. I would prefer the average case over the
>>>> special
>>>>>>> case
>>>>>>>>>> here.
>>>>>>>>>>>> 
>>>>>>>>>>>> NaNs are less common in the SQL world than in the
>> DataFrame
>>>>>> world
>>>>>>>>> where
>>>>>>>>>>>> NaNs were used for a long time to represent missing
>> values.
>>>>>> They
>>>>>>>>> still
>>>>>>>>>>>> exist with different canonical representations and
>>>> different
>>>>>> sign
>>>>>>>>>> bits. I
>>>>>>>>>>>> agree it might not be correct semantically, but sadly
>> that
>>>> is
>>>>>> the
>>>>>>>>> world
>>>>>>>>>>> we
>>>>>>>>>>>> deal with. NumPy and Numba do not have missing data
>>>>>>> functionality,
>>>>>>>>>> people
>>>>>>>>>>>> use NaNs there, and people definitely use that in their
>>>>>>> analytical
>>>>>>>>>>>> dataflows. Another point that was brought up in the GH
>>>>>> discussion
>>>>>>>> was
>>>>>>>>>>> "what
>>>>>>>>>>>> about infinity? You could argue that having infinity in
>>>>>>> statistics
>>>>>>>> is
>>>>>>>>>>>> similarly unuseful as it's too wide of a bound". I
>> would
>>>>> argue
>>>>>>> that
>>>>>>>>>>>> infinity is very different as there is no discussion on
>>>> what
>>>>>> the
>>>>>>>>>> ordering
>>>>>>>>>>>> or pattern of infinity is. Everyone agrees that
>> `min(1.0,
>>>>> inf,
>>>>>>>> -inf)
>>>>>>>>> ==
>>>>>>>>>>>> -inf` and each infinity only has a single bit pattern.
>>>>>>>>>>>> 
>>>>>>>>>>>>> It gives a defined order to every bit pattern and
>> thus
>>>>>> yields a
>>>>>>>>> total
>>>>>>>>>>>> order, mathematically speaking, which has value by
>> itself.
>>>>> With
>>>>>>> NaN
>>>>>>>>>>> counts,
>>>>>>>>>>>> it was still undefined how different bit patterns of
>> NaNs
>>>>> were
>>>>>>>>> supposed
>>>>>>>>>>> to
>>>>>>>>>>>> be ordered, whether NaN was allowed to have a sign bit,
>>>> etc.,
>>>>>>>> risking
>>>>>>>>>>> that
>>>>>>>>>>>> different engines could come to different results while
>>>>>> filtering
>>>>>>>> or
>>>>>>>>>>>> sorting values within a file.
>>>>>>>>>>>> 
>>>>>>>>>>>> Since the proposal phrases it as a goal to work
>> "regardless
>>>>> of
>>>>>>> how
>>>>>>>>> they
>>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
>>>>>> out-of-place
>>>>>>> to
>>>>>>>>> me.
>>>>>>>>>>>> Most hardware and most people don't care about total
>>>> ordering
>>>>>> and
>>>>>>>>>> needing
>>>>>>>>>>>> to take it into account while filtering using
>> statistics
>>>>> seems
>>>>>>> like
>>>>>>>>>>>> preferring the special case instead of the common case.
>>>>> Almost
>>>>>>>> noone
>>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
>> engines
>>>> that
>>>>>>> don't
>>>>>>>>>> have
>>>>>>>>>>>> IEEE total ordering as their default ordering for
>> floats
>>>> will
>>>>>>> also
>>>>>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>> do more special handling for this.
>>>>>>>>>>>> 
>>>>>>>>>>>> I also agree with my colleague that doing an approach
>> that
>>>> is
>>>>>> 50%
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>> way there will make the barrier to improving it to
>> what it
>>>>>>> actually
>>>>>>>>>>> should
>>>>>>>>>>>> be later on much higher.
>>>>>>>>>>>> 
>>>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
>> and
>>>>>> `sort
>>>>>>>>>>> ordering`
>>>>>>>>>>>> proposals into one to make one proposal, as they are
>> linked
>>>>>>>> together,
>>>>>>>>>> and
>>>>>>>>>>>> moving forward with one without knowing what will
>> happen to
>>>>> the
>>>>>>>> other
>>>>>>>>>>> seems
>>>>>>>>>>>> unwise. From a Polars perspective, having a
>> `nan_count` and
>>>>>>>> defining
>>>>>>>>>> what
>>>>>>>>>>>> happens to the `min` and `max` statistics when a page
>>>>> contains
>>>>>>> only
>>>>>>>>>> NaNs
>>>>>>>>>>> is
>>>>>>>>>>>> enough to allow for all predicate filtering. I think,
>> but
>>>>>> correct
>>>>>>>> me
>>>>>>>>>> if I
>>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
>>>> don't
>>>>>> use
>>>>>>>>> total
>>>>>>>>>>>> ordering. But if you want to be impartial to the
>> engine's
>>>>>>>>>> floating-point
>>>>>>>>>>>> ordering and allow engines with total ordering to do
>>>>> inequality
>>>>>>>>> filters
>>>>>>>>>>>> when `nan_count > 0` you would need a
>> `positive_nan_count`
>>>>> and
>>>>>> a
>>>>>>>>>>>> `negative_nan_count`. I understand the downside with
>> Thrift
>>>>>>>>> complexity,
>>>>>>>>>>> but
>>>>>>>>>>>> introducing another sort order is also adding
>> complexity
>>>> just
>>>>>> in
>>>>>>> a
>>>>>>>>>>>> different place.
>>>>>>>>>>>> 
>>>>>>>>>>>> I would really like to see this move forward, so I hope
>>>> these
>>>>>>>>> concerns
>>>>>>>>>>> help
>>>>>>>>>>>> move it forward towards a solution that works for
>> everyone.
>>>>>>>>>>>> 
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>> Gijs
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <
>>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I would also be in favor of starting a vote
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <
>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As the author of both the IEEE754 total order
>>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/221>
>> PR
>>>>> and
>>>>>>> the
>>>>>>>>>>> earlier
>>>>>>>>>>>>> PR
>>>>>>>>>>>>>> that basically proposed `nan_count`
>>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196
>>> ,
>>>> my
>>>>>>>> current
>>>>>>>>>> vote
>>>>>>>>>>>>> would
>>>>>>>>>>>>>> be for IEEE754 total order.
>>>>>>>>>>>>>> Consequently, I would like to request a formal
>> vote for
>>>>> the
>>>>>>> PR
>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>> IEEE754 total order (
>>>>>>>>>>> https://github.com/apache/parquet-format/pull/221
>>>>>>>>>>>> ),
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> that is possible.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My Rationales:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  - It's conceptually simpler. It's easier to
>> explain.
>>>>>> It's
>>>>>>>>> based
>>>>>>>>>> on
>>>>>>>>>>>> an
>>>>>>>>>>>>>>  IEEE-standardized order predicate.
>>>>>>>>>>>>>>  - There are already multiple implementations
>> showing
>>>>>>>>>> feasibility.
>>>>>>>>>>>> This
>>>>>>>>>>>>>>  will likely make the adoption quicker.
>>>>>>>>>>>>>>  - It gives a defined order to every bit pattern
>> and
>>>>> thus
>>>>>>>>> yields
>>>>>>>>>> a
>>>>>>>>>>>>> total
>>>>>>>>>>>>>>  order, mathematically speaking, which has value
>> by
>>>>>> itself.
>>>>>>>>> With
>>>>>>>>>>> NaN
>>>>>>>>>>>>>> counts,
>>>>>>>>>>>>>>  it was still undefined how different bit
>> patterns of
>>>>>> NaNs
>>>>>>>> were
>>>>>>>>>>>>> supposed
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>  be ordered, whether NaN was allowed to have a
>> sign
>>>>> bit,
>>>>>>>> etc.,
>>>>>>>>>>>> risking
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>  different engines could come to different
>> results
>>>>> while
>>>>>>>>>> filtering
>>>>>>>>>>> or
>>>>>>>>>>>>>>  sorting values within a file.
>>>>>>>>>>>>>>  - It also solves sort order completely. With
>>>>> nan_counts
>>>>>>>> only,
>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>>>>>  still undefined whether nans should be sorted
>> before
>>>>> or
>>>>>>>> after
>>>>>>>>>> all
>>>>>>>>>>>>> values
>>>>>>>>>>>>>>  (or both, depending on sign bit), so any file
>>>>> including
>>>>>>> NaNs
>>>>>>>>>> could
>>>>>>>>>>>> not
>>>>>>>>>>>>>>  really leverage sort order without being
>> ambiguous.
>>>>>>>>>>>>>>  - It's less complex in thrift. Having fields
>> that
>>>> only
>>>>>>> apply
>>>>>>>>> to
>>>>>>>>>> a
>>>>>>>>>>>>>>  handful of data types is somehow weird. If every
>>>> type
>>>>>> did
>>>>>>>>> this,
>>>>>>>>>> we
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>  have a plethora of non-generic fields in thrift.
>>>>>>>>>>>>>>  - Treating NaNs so specially is giving them
>>>> attention
>>>>>> they
>>>>>>>>> don't
>>>>>>>>>>>>>>  deserve. Most data sets do not contain NaNs. If
>> a
>>>> use
>>>>>> case
>>>>>>>>>> really
>>>>>>>>>>>>>> requires
>>>>>>>>>>>>>>  them and needs filtering to ignore them, they
>> can
>>>>> store
>>>>>>> NULL
>>>>>>>>>>>> instead,
>>>>>>>>>>>>>>  or encode them differently. I would prefer the
>>>> average
>>>>>>> case
>>>>>>>>> over
>>>>>>>>>>> the
>>>>>>>>>>>>>>  special case here.
>>>>>>>>>>>>>>  - The majority of the people discussing this so
>> far
>>>>> seem
>>>>>>> to
>>>>>>>>>> favor
>>>>>>>>>>>>> total
>>>>>>>>>>>>>>  order.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu
>> <
>>>>>>>>>> [email protected]
>>>>>>>>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As this discussion has been open for more than
>> two
>>>>> years,
>>>>>>> I’d
>>>>>>>>>> like
>>>>>>>>>>> to
>>>>>>>>>>>>>> bump
>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> this thread again to update the progress and
>> collect
>>>>>>>> feedback.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Background*
>>>>>>>>>>>>>>> • Today Parquet’s min/max stats and page index
>> omit
>>>>> NaNs
>>>>>>>>>> entirely.
>>>>>>>>>>>>>>> • Engines can’t safely prune floating values
>> because
>>>>> they
>>>>>>>> know
>>>>>>>>>>>> nothing
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> NaNs.
>>>>>>>>>>>>>>> • Column index is disabled if any page contains
>> only
>>>>>> NaNs.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There are two active proposals as below:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Proposal A - IEEE754TotalOrder* (from the PR
>> [1])
>>>>>>>>>>>>>>> • Define a new ColumnOrder to include +0, –0 and
>> all
>>>>> NaN
>>>>>>>>>>>> bit‐patterns.
>>>>>>>>>>>>>>> • Stats and column index store NaNs if they
>> appear.
>>>>>>>>>>>>>>> • Three PoC impls are ready: arrow-rs [2],
>> duckdb [3]
>>>>> and
>>>>>>>>>>>> parquet-java
>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>> • For more context of this approach, please
>> refer to
>>>>>>>> discussion
>>>>>>>>>> in
>>>>>>>>>>>> [5].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Proposal B - add nan_count* (from a comment [6]
>> to
>>>>> [1])
>>>>>>>>>>>>>>> • Add `nan_count` to stats and a `nan_counts`
>> list to
>>>>>>> column
>>>>>>>>>> index.
>>>>>>>>>>>>>>> • For all‐NaNs cases, write NaN to min/max and
>> use
>>>>>>> nan_count
>>>>>>>> to
>>>>>>>>>>>>>>> distinguish.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Both solutions have pros and cons but are way
>> better
>>>>> than
>>>>>>> the
>>>>>>>>>>> status
>>>>>>>>>>>>> quo
>>>>>>>>>>>>>>> today.
>>>>>>>>>>>>>>> Please share your thoughts on the two proposals
>>>> above,
>>>>> or
>>>>>>>> maybe
>>>>>>>>>>> come
>>>>>>>>>>>> up
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> better alternatives. We need consensus on one
>>>> proposal
>>>>>> and
>>>>>>>> move
>>>>>>>>>>>>> forward.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>> https://github.com/apache/parquet-format/pull/221
>>>>>>>>>>>>>>> [2] https://github.com/apache/arrow-rs/pull/7408
>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
>>>>>>>>>>>>>>> [4]
>> https://github.com/apache/parquet-java/pull/3191
>>>>>>>>>>>>>>> [5]
>>>> https://github.com/apache/parquet-format/pull/196
>>>>>>>>>>>>>>> [6]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Gang
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <
>>>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Dear contributors,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> My PR has now gathered comments for a week and
>> the
>>>>> gist
>>>>>>> of
>>>>>>>>> all
>>>>>>>>>>> open
>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>> is the question of how to encode pages/column
>>>> chunks
>>>>>> that
>>>>>>>>>> contain
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>> NaNs. There are different suggestions and I
>> don't
>>>> see
>>>>>> one
>>>>>>>>>> common
>>>>>>>>>>>>>> favorite
>>>>>>>>>>>>>>>> yet.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have outlined three alternatives of how we
>> can
>>>>> handle
>>>>>>>> these
>>>>>>>>>>> and I
>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> us
>>>>>>>>>>>>>>>> to reach a conclusion here, so I can update my
>> PR
>>>>>>>> accordingly
>>>>>>>>>> and
>>>>>>>>>>>>> move
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> with it. As this is my first contribution to
>>>>> parquet, I
>>>>>>>> don't
>>>>>>>>>>> know
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> decision processes here. Do we vote? Is there a
>>>>> single
>>>>>> or
>>>>>>>>> group
>>>>>>>>>>> of
>>>>>>>>>>>>>>> decision
>>>>>>>>>>>>>>>> makers? *Please let me know how to come to a
>>>>> conclusion
>>>>>>>> here;
>>>>>>>>>>> what
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> next steps?*
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For reference, here are the three alternatives
>> I
>>>>>> pointed
>>>>>>>> out.
>>>>>>>>>> You
>>>>>>>>>>>> can
>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>> detailed description of their PROs and CONs in
>> my
>>>>>>> comment:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. My initial proposal, i.e., encoding only-NaN
>>>> pages
>>>>>> by
>>>>>>>>>>>> min=max=NaN.
>>>>>>>>>>>>>>>> 2. Adding `num_values` to the ColumnIndex, to
>> make
>>>> it
>>>>>>>>> symmetric
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> Statistics in pages & `ColumnMetaData` and to
>>>> enable
>>>>>> the
>>>>>>>>>>>> computation
>>>>>>>>>>>>>>>> `num_values - null_count - nan_count == 0`
>>>>>>>>>>>>>>>> 3. Adding a `nan_pages` bool list to the column
>>>>> index,
>>>>>>>> which
>>>>>>>>>>>>> indicates
>>>>>>>>>>>>>>>> whether a page contains only NaNs
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>>> Jan Finis
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 


Reply via email to