Hello Gang and others, I am willing to implement the C++ POC.
> 2026年3月14日 23:56,Gang Wu <[email protected]> 写道: > > Update: > > Java POC is ready for IEEE 754 column order combined with nan_count: > https://github.com/apache/parquet-java/pull/3393 > > The spec PR has been updated earlier to address all comments: > https://github.com/apache/parquet-format/pull/514 > > Really appreciate any review and feedback! > > Best, > Gang > > > > > On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote: > >> Hello all, >> >> I'm reaching out to help drive this long-running discussion—nearly >> three years now—towards a final resolution. With Jan's authorization, >> and my sincere thanks for his sustained effort, I want to help push >> this issue to the finish line. >> >> To recap, we have two primary proposals on how to handle NaNs in >> statistics and column indexes: >> >> * IEEE 754 Total Order [1]: Proposes adding a new column order >> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined >> ordering for every float bit pattern, including NaNs and -0/+0, >> allowing writers to include NaNs in min/max and removing ambiguity for >> only-NaN pages. >> * Combined Approach [2]: Proposes adopting the IEEE 754 total order >> alongside explicit nan_count(s) fields. This approach mandates the >> nan_count(s) when the new order is used and clarifies how to handle >> edge cases from legacy writers. >> >> Based on the recent comments, it appears the combined approach [2] is >> gaining consensus, although the IEEE 754 total order [1] still has >> strong advocates. >> >> I agree with the sentiment that technical direction should be made by >> consensus, not a vote. To that end, I'd like to solicit further >> feedback specifically on the combined approach [2] to see if we can >> achieve the necessary consensus to move forward now. >> >> I recall that the total order proposal [1] already has three PoC >> implementations. For the combined approach [2], I can draft a PoC in >> parquet-java, but to meet the two-implementation requirement, we would >> need one more contributor to step up. >> >> [1] https://github.com/apache/parquet-format/pull/221 >> [2] https://github.com/apache/parquet-format/pull/514 >> >> Best, >> Gang >> >> >> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn <[email protected]> >> wrote: >>> >>> Hello Jan, >>> >>> Thank you for pushing this through. Apart from some smaller nits, we also >>> really like the current proposal. >>> >>> Thanks, >>> Gijs >>> >>> On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]> >> wrote: >>> >>>> I have started organizing a project[1] in arrow-rs 's Parquet reader >> to try >>>> and implement this proposal. >>>> >>>> Hopefully that can be 1 / 2 open source implementations needed. >>>> >>>> Thanks again for helping drive this along, >>>> Andrew >>>> >>>> [1] https://github.com/apache/arrow-rs/issues/8156 >>>> >>>> On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote: >>>> >>>>> I have now tagged >>>>> < >>>> >> https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173 >>>>>> >>>>> the people that argued for total order in the initial PR. Let's see >> their >>>>> response. >>>>> >>>>> If I understand the adoption process correctly, the next hurdle to >>>> getting >>>>> this adopted is two open >>>>> source (!) implementations proving its feasibility. We already had >> that >>>> for >>>>> IEEE total order. If we >>>>> prefer the solution with nan counts, we'll need it there as well. I >>>> myself >>>>> work on a proprietary >>>>> implementation, so I'm counting on others here :). Be prepared >> though, >>>> this >>>>> will likely take months >>>>> unless the interest in this topic has risen to a point where people >> are >>>>> eager to jump on the implementation >>>>> right away. >>>>> >>>>> So, I guess it will take some months of soaking time before any >> formal >>>> vote >>>>> can be done >>>>> (given that we reach consensus that this is what we want and we find >>>> people >>>>> for the implementations). >>>>> >>>>> Cheers, >>>>> Jan >>>>> >>>>> Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue < >> [email protected]>: >>>>> >>>>>> Thanks, Jan. I also went through the combined proposal and it looks >>>>> mostly >>>>>> good to me. >>>>>> >>>>>>> First of all, to make it quick: Yes, the solution of having >>>> nan_counts >>>>>> *and* total order, which was brought up multiple times, does work >> and >>>>>> solves more cases than just either of both. >>>>>> >>>>>> Great, then we have a solution for both filtering use cases and for >>>>> moving >>>>>> ahead with total order. And thanks to Andrew for suggesting this as >>>> well >>>>> on >>>>>> the second PR. I think this also looks like this is something that >>>> Orson >>>>> is >>>>>> okay with given his comments on the latest PR. >>>>>> >>>>>> Is there anyone against the combined approach? I don't see a big >>>> downside >>>>>> for anyone. It is compatible with previous stats rules, has a NaN >>>> count, >>>>>> and allows using either type-specific order or total order. >>>>>> >>>>>> Assuming that this satisfies the big objections, I think we should >> wait >>>>> for >>>>>> a few days to make sure everyone has time to check out the new PR >> and >>>>> then >>>>>> vote to adopt it. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb < >> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thank you Jan -- I read through the new combined proposal, and I >>>>> thought >>>>>> it >>>>>>> looks good and addresses the feedback so far. I left some small >> style >>>>>>> suggestions, but nothing that is required from my perspective >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]> >> wrote: >>>>>>> >>>>>>>> Hey Ryan, >>>>>>>> >>>>>>>> Thanks for chiming in. First of all, to make it quick: Yes, the >>>>>> solution >>>>>>> of >>>>>>>> having nan_counts *and* total order, which was brought up >> multiple >>>>>> times, >>>>>>>> does work and solves more cases than just either of both. >>>>>>>> >>>>>>>> I strongly prefer continuing to discuss the merits of these >>>>> approaches >>>>>>>>> rather than trying to decide with a vote. >>>>>>>> >>>>>>>> >>>>>>>> In theory, I agree that it isn't good to silence a discussion >> by >>>> just >>>>>>>> voting for one possible solution and technical issues should be >>>>>>> discussed. >>>>>>>> However, please note that we have been circling on this for >> over >>>> two >>>>>>> years >>>>>>>> now, including an extended discussion that brought up all >> arguments >>>>>>>> multiple times. This is in stark contrast to the >>>>>>>> speed with which you guys work on the Iceberg spec, for >> example. >>>>> There, >>>>>>> you >>>>>>>> also do not discuss the merits of various solutions for >> multiple >>>>> years. >>>>>>> You >>>>>>>> just pick one and merge it after a *reasonable* time of >> discussion. >>>>>>>> If you had the speed we currently have here, nothing would get >>>> done. >>>>>>> Thus, >>>>>>>> I see this as a clear case of *"the perfect is the enemy of the >>>>> good"*. >>>>>>>> Yes, we can continue looking for the perfect solution, >>>>>>>> but that will likely lead to keeping us at the status quo, >> which is >>>>> the >>>>>>>> worst of them all. >>>>>>>> >>>>>>>> That being said, I'm also happy to create a PR which does both >>>> total >>>>>>> order >>>>>>>> and NaN counts; after all, I just want the issue solved and all >>>> these >>>>>>>> solutions are better than the status quo. >>>>>>>> >>>>>>>> *As this was now suggest by at least three people, I guess it's >>>> worth >>>>>>>> doing, so here you go: >>>>>> https://github.com/apache/parquet-format/pull/514 >>>>>>>> <https://github.com/apache/parquet-format/pull/514>* >>>>>>>> >>>>>>>> With this, we should have PRs covering most of the solution >> space. >>>>>>>> (I'm refusing to create a PR with negative and positive >> nan_counts; >>>>>>>> nan_counts + total order has to suffice; the complexity >> madness has >>>>> to >>>>>>> stop >>>>>>>> somewhere) >>>>>>>> I still believe that there was an amount of people who already >>>> found >>>>>>>> nan_counts too complex and therefore wanted IEEE total order, >> and >>>>> these >>>>>>>> people may not like putting on extra complexity, >>>>>>>> but let's see, maybe some have also changed their opinion in >> the >>>>>>> meantime. >>>>>>>> >>>>>>>> >>>>>>>> *Given all this, we can also first do an informal vote where >>>> everyone >>>>>> can >>>>>>>> vote for which of the three their favorite would be.Maybe a >> clear >>>>>>> favorite >>>>>>>> will emerge and then we can vote on this one.* >>>>>>>> >>>>>>>> But of course, we can also take some weeks to discuss the three >>>>>>> solutions, >>>>>>>> now that we have PRs for all of them. I just hope this won't >> make >>>> us >>>>>>>> continue for another 2 years, or an >>>>>>>> infinite stalemate where each solution is vetoed by a PMC >> member. >>>>>>>> (Sorry for becoming a bit cynical here; I have just spent way >> too >>>>> much >>>>>>> time >>>>>>>> of my life with double statistics at this point ;) ...) >>>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Jan >>>>>>>> >>>>>>>> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue < >>>>> [email protected] >>>>>>> : >>>>>>>> >>>>>>>>> Regarding the process for this, I strongly prefer continuing >> to >>>>>> discuss >>>>>>>> the >>>>>>>>> merits of these approaches rather than trying to decide with >> a >>>>> vote. >>>>>> I >>>>>>>>> don't think it is a good practice to use a vote to decide on >> a >>>>>>> technical >>>>>>>>> direction. There are very few situations that warrant it and >> I >>>>> don't >>>>>>>> think >>>>>>>>> that this is one of them. While this issue has been open for >> a >>>> long >>>>>>> time, >>>>>>>>> that appears to be the result of it not being anyone's top >>>> priority >>>>>>>> rather >>>>>>>>> than indecision. >>>>>>>>> >>>>>>>>> For the technical merits of these approaches, I think that >> we can >>>>>> find >>>>>>> a >>>>>>>>> middle ground. I agree with Jan that when working with sorted >>>>> values, >>>>>>> we >>>>>>>>> need to know how NaN values were handled and that requires >> using >>>> a >>>>>>>>> well-defined order that includes NaN and its variations >> (because >>>> we >>>>>>>> should >>>>>>>>> not normalize). Using NaN count is not sufficient for >> ordering >>>>> rows. >>>>>>>>> >>>>>>>>> Gijs also brings up good points about how NaN values show up >> in >>>>>> actual >>>>>>>>> datasets: not just when used in place of null, but also as >> the >>>>> result >>>>>>> of >>>>>>>>> normal calculations on abnormal data, like `sqrt(-4.0)` or >>>>>> `log(-1.0)`. >>>>>>>>> Both of those present problems when mixed with valid data >> because >>>>> of >>>>>>> the >>>>>>>>> stats "poisoning" problem, where the range of valid data is >>>> usable >>>>>>> until >>>>>>>> a >>>>>>>>> single NaN is mixed in. >>>>>>>>> >>>>>>>>> Another issue is that NaN is error-prone because "regular" >>>>> comparison >>>>>>> is >>>>>>>>> always false: >>>>>>>>> ``` >>>>>>>>> Math.log(-1.0) >= 2 => FALSE >>>>>>>>> Math.log(-1.0) < 2 => FALSE >>>>>>>>> 2 > Math.log(-1.0) => FALSE >>>>>>>>> ``` >>>>>>>>> >>>>>>>>> As a result, Iceberg doesn't trust NaN values as either >> lower or >>>>>> upper >>>>>>>>> bounds because we don't want to go back to the code that >> produced >>>>> the >>>>>>>> value >>>>>>>>> to see what the comparison order was to determine whether NaN >>>>> values >>>>>> go >>>>>>>>> before or after others. >>>>>>>>> >>>>>>>>> Total order solves the second issue in theory, but regular >>>>> comparison >>>>>>> is >>>>>>>>> prevalent and not obvious to developers. And it also doesn't >> help >>>>>> when >>>>>>>> NaN >>>>>>>>> is used instead of null. So using total order is not >> sufficient >>>> for >>>>>>> data >>>>>>>>> skipping. >>>>>>>>> >>>>>>>>> I think the right compromise is to use `min`, `max`, and >>>>> `nan_count` >>>>>>> for >>>>>>>>> data skipping stats (where min and max cannot be NaN) and >> total >>>>>>> ordering >>>>>>>>> for sorting values. That satisfies the data skipping use >> cases >>>> and >>>>>> also >>>>>>>>> gives us an ordering of unaltered values that we can reason >>>> about. >>>>>>>>> >>>>>>>>> Does anyone think that doesn't work? >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]> >> wrote: >>>>>>>>> >>>>>>>>>> Thanks Jan for your endless effort on this! >>>>>>>>>> >>>>>>>>>> I'm in favor of simplicity and generalism. I think we have >>>>> already >>>>>>>>> debated >>>>>>>>>> a lot >>>>>>>>>> for `nan_count` in [1] and [2] is the reflection of those >>>>>>> discussions. >>>>>>>>>> Therefore >>>>>>>>>> I am inclined to start a vote for [2] unless there is a >>>>>> significantly >>>>>>>>>> better >>>>>>>>>> proposal. >>>>>>>>>> >>>>>>>>>> I would suggest everyone interested in this discussion to >>>> attend >>>>>> the >>>>>>>>>> scheduled >>>>>>>>>> sync on Aug 6th (detailed below) to spread the word to the >>>>> broader >>>>>>>>>> community. >>>>>>>>>> If we can get a consensus on [2], I can help start the >> vote and >>>>>> move >>>>>>>>>> forward. >>>>>>>>>> >>>>>>>>>> *Apache Parquet Community Sync Wednesday, August 6 · 10:00 >> – >>>>>> 11:00am >>>>>>> * >>>>>>>>>> *Time zone: America/Los_Angeles* >>>>>>>>>> *Google Meet joining info Video call link: >>>>>>>>>> https://meet.google.com/bhe-rvan-qjk >>>>>>>>>> <https://meet.google.com/bhe-rvan-qjk> * >>>>>>>>>> >>>>>>>>>> [1] https://github.com/apache/parquet-format/pull/196 >>>>>>>>>> [2] https://github.com/apache/parquet-format/pull/221 >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Gang >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Aug 1, 2025 at 6:16 PM Jan Finis < >> [email protected]> >>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Gijs, >>>>>>>>>>> >>>>>>>>>>> Thank you for bringing up concrete points, I'm happy to >>>> discuss >>>>>>> them >>>>>>>> in >>>>>>>>>>> detail. >>>>>>>>>>> >>>>>>>>>>> NaNs are less common in the SQL world than in the >> DataFrame >>>>> world >>>>>>>> where >>>>>>>>>>>> NaNs were used for a long time to represent missing >> values. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> You could transcode between NULL to NaN before reading >> and >>>>>> writing >>>>>>> to >>>>>>>>>>> Parquet. You basically mention yourself that NaNs were >> used >>>> for >>>>>>>> missing >>>>>>>>>>> values, i.e., what is commonly a NULL, which wasn't >>>> available. >>>>>> So, >>>>>>>>>>> semantically, transcoding to NULL would even be the sane >>>> thing >>>>> to >>>>>>> do. >>>>>>>>>> Yes, >>>>>>>>>>> that will cost you some cycles, but should be a rather >>>>>> lightweight >>>>>>>>>>> operation in comparison to most other operations, so I >> would >>>>>> argue >>>>>>>> that >>>>>>>>>> it >>>>>>>>>>> won't totally ruin your performance. Similarly, why >> should >>>>>> Parquet >>>>>>>> play >>>>>>>>>>> along with a "hack" that was done in other frameworks >> due to >>>>>>>>> shortcomings >>>>>>>>>>> of those frameworks? So from a philosophical point of >> view, I >>>>>> think >>>>>>>>>>> supporting NaNs better is the wrong thing to do. Rather, >> we >>>>>> should >>>>>>>> be a >>>>>>>>>>> forcing function to align others to better behavior, so >>>>> appling a >>>>>>> bit >>>>>>>>> of >>>>>>>>>>> force might in the long run make people use NULLs also in >>>>>>> DataFrames. >>>>>>>>>>> >>>>>>>>>>> Of course, your argument also goes into the direction of >>>>>>> pragmatism: >>>>>>>>> If a >>>>>>>>>>> large part of the data science world uses NaNs to encode >>>>> missing >>>>>>>>> values, >>>>>>>>>>> then maybe Parquet should accept this de-facto standard >>>> rather >>>>>> than >>>>>>>>>>> fighting it. That is indeed a valid point. The weight of >> it >>>> is >>>>>>>>> debatable >>>>>>>>>>> and my personal conclusion is that it's still not worth >> it, >>>> as >>>>>> you >>>>>>>> can >>>>>>>>>>> transcode between NULLs and NaNs, but I do agree with its >>>>>> validity. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Since the proposal phrases it as a goal to work >> "regardless >>>> of >>>>>> how >>>>>>>> they >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels >>>>>> out-of-place >>>>>>> to >>>>>>>>> me. >>>>>>>>>>>> Most hardware and most people don't care about total >>>> ordering >>>>>> and >>>>>>>>>> needing >>>>>>>>>>>> to take it into account while filtering using >> statistics >>>>> seems >>>>>>> like >>>>>>>>>>>> preferring the special case instead of the common case. >>>>> Almost >>>>>>>> noone >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL >> engines >>>> that >>>>>>> don't >>>>>>>>>> have >>>>>>>>>>>> IEEE total ordering as their default ordering for >> floats >>>> will >>>>>>> also >>>>>>>>> need >>>>>>>>>>> to >>>>>>>>>>>> do more special handling for this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I disagree with the conclusion this statement draws. The >>>>> current >>>>>>>>>> behavior, >>>>>>>>>>> and nan_counts without total ordering, pose a real >> problem >>>>> here, >>>>>>> even >>>>>>>>> for >>>>>>>>>>> engines that don't care about bit patterns. I do agree >> that >>>>> most >>>>>>>>> database >>>>>>>>>>> engines, including the one I'm working on, do not care >> about >>>>> bit >>>>>>>>> patterns >>>>>>>>>>> and/or sign bits. However, how can our database engine >> know >>>>>> whether >>>>>>>> the >>>>>>>>>>> writer of a Parquet file saw it the same way? It can't. >>>>>> Therefore, >>>>>>> it >>>>>>>>>>> cannot know whether a writer, for example, ordered NaNs >>>> before >>>>> or >>>>>>>> after >>>>>>>>>> all >>>>>>>>>>> other numbers, or maybe ordered them by sign bit. So, if >> our >>>>>>> database >>>>>>>>>>> engine now sees a float column in sorting columns, it >> cannot >>>>>> apply >>>>>>>> any >>>>>>>>>>> optimization without a lot of special casing, as it >> doesn't >>>>> know >>>>>>>>> whether >>>>>>>>>>> NaNs will be before all other values, after all other >> values, >>>>> or >>>>>>>> maybe >>>>>>>>>>> both, depending on sign bit. It could apply contrived >> logic >>>>> that >>>>>>>> tries >>>>>>>>> to >>>>>>>>>>> infer where NaNs were placed from the NaN counts of the >> first >>>>> and >>>>>>>> last >>>>>>>>>>> page, but doing so will be a lot of ugly code that also >> feels >>>>> to >>>>>> be >>>>>>>> in >>>>>>>>>> the >>>>>>>>>>> wrong place. I.e., I don't want to need to load pages or >> the >>>>> page >>>>>>>>> index, >>>>>>>>>>> just to reason about a sort order. >>>>>>>>>>> >>>>>>>>>>> SQL engines that don't have >>>>>>>>>>>> IEEE total ordering as their default ordering for >> floats >>>> will >>>>>>> also >>>>>>>>> need >>>>>>>>>>> to >>>>>>>>>>>> do more special handling for this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This code, which I would indeed need to write for our >> engine, >>>>> is >>>>>>>>>> comparably >>>>>>>>>>> trivial. Simply choose the largest possible bit pattern >> as >>>>>>> comparison >>>>>>>>> for >>>>>>>>>>> upper bounds filtering for NaN, and the smallest >> possible bit >>>>>>> pattern >>>>>>>>> for >>>>>>>>>>> lower bounds. It's not more than a few lines of code that >>>> check >>>>>>>>> whether a >>>>>>>>>>> filter is NaN and then replace its value with the >>>>> highest/lowest >>>>>>> NaN >>>>>>>>> bit >>>>>>>>>>> pattern. It is similarly trivial to the special casing I >> need >>>>> to >>>>>> do >>>>>>>>> with >>>>>>>>>>> nan_counts, and it is way more trivial than the extra >> code I >>>>>> would >>>>>>>> need >>>>>>>>>> to >>>>>>>>>>> write for sorting columns, as depicted above. >>>>>>>>>>> >>>>>>>>>>> From a Polars perspective, having a `nan_count` and >> defining >>>>> what >>>>>>>>>>>> happens to the `min` and `max` statistics when a page >>>>> contains >>>>>>> only >>>>>>>>>> NaNs >>>>>>>>>>> is >>>>>>>>>>>> enough to allow for all predicate filtering. I think, >> but >>>>>> correct >>>>>>>> me >>>>>>>>>> if I >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that >>>> don't >>>>>> use >>>>>>>>> total >>>>>>>>>>>> ordering. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It's not fully enough, as depicted above. Sorting columns >>>> would >>>>>>> still >>>>>>>>> not >>>>>>>>>>> work properly. >>>>>>>>>>> >>>>>>>>>>> As for ways forward, I propose merging the `nan_count` >> and >>>>> `sort >>>>>>>>>> ordering` >>>>>>>>>>>> proposals into one to make one proposal >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Note that the initial reason for proposing IEEE total >> order >>>> was >>>>>>> that >>>>>>>>>> people >>>>>>>>>>> in the discussion threads found nan_counts to be too >> complex >>>>> and >>>>>>> too >>>>>>>>> much >>>>>>>>>>> of an undeserving special case (re-read the discussion >> in the >>>>>>> initial >>>>>>>>> PR >>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196> to >> see >>>> the >>>>>>>>>>> rationales). >>>>>>>>>>> So merging both together would go totally against the >> spirit >>>> of >>>>>> why >>>>>>>>> IEEE >>>>>>>>>>> total order was proposed. While it has further upsides, >> the >>>>> main >>>>>>>> reason >>>>>>>>>> was >>>>>>>>>>> indeed to *not have* nan_counts. If now the proposal >> would >>>> even >>>>>> go >>>>>>> to >>>>>>>>>>> positive and negative nan counts (i.e., even more >>>> complexity), >>>>>> this >>>>>>>>> would >>>>>>>>>>> go 180 degrees into the opposite direction of why people >>>> wanted >>>>>>> total >>>>>>>>>> order >>>>>>>>>>> in the first place. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Jan >>>>>>>>>>> >>>>>>>>>>> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn >>>>>>>>>>> <[email protected]>: >>>>>>>>>>> >>>>>>>>>>>> Hello Jan and others, >>>>>>>>>>>> >>>>>>>>>>>> First, let me preface by saying I am quite new here. >> So I >>>>>>> apologize >>>>>>>>> if >>>>>>>>>>>> there is some other better way to bring up these >> concerns. >>>> I >>>>>>>>> understand >>>>>>>>>>> it >>>>>>>>>>>> is very annoying to come in at the 11th hour and start >>>>> bringing >>>>>>> up >>>>>>>> a >>>>>>>>>>> bunch >>>>>>>>>>>> of concerns, but I would also like this to be done >> right. A >>>>>>>> colleague >>>>>>>>>> of >>>>>>>>>>>> mine brought up some concerns and alternative >> approaches in >>>>> the >>>>>>>>> GitHub >>>>>>>>>>>> thread; I will file some of the concerns here as a >>>> response. >>>>>>>>>>>> >>>>>>>>>>>>> Treating NaNs so specially is giving them attention >> they >>>>>> don't >>>>>>>>>> deserve. >>>>>>>>>>>> Most data sets do not contain NaNs. If a use case >> really >>>>>> requires >>>>>>>>> them >>>>>>>>>>> and >>>>>>>>>>>> needs filtering to ignore them, they can store NULL >>>> instead, >>>>> or >>>>>>>>> encode >>>>>>>>>>> them >>>>>>>>>>>> differently. I would prefer the average case over the >>>> special >>>>>>> case >>>>>>>>>> here. >>>>>>>>>>>> >>>>>>>>>>>> NaNs are less common in the SQL world than in the >> DataFrame >>>>>> world >>>>>>>>> where >>>>>>>>>>>> NaNs were used for a long time to represent missing >> values. >>>>>> They >>>>>>>>> still >>>>>>>>>>>> exist with different canonical representations and >>>> different >>>>>> sign >>>>>>>>>> bits. I >>>>>>>>>>>> agree it might not be correct semantically, but sadly >> that >>>> is >>>>>> the >>>>>>>>> world >>>>>>>>>>> we >>>>>>>>>>>> deal with. NumPy and Numba do not have missing data >>>>>>> functionality, >>>>>>>>>> people >>>>>>>>>>>> use NaNs there, and people definitely use that in their >>>>>>> analytical >>>>>>>>>>>> dataflows. Another point that was brought up in the GH >>>>>> discussion >>>>>>>> was >>>>>>>>>>> "what >>>>>>>>>>>> about infinity? You could argue that having infinity in >>>>>>> statistics >>>>>>>> is >>>>>>>>>>>> similarly unuseful as it's too wide of a bound". I >> would >>>>> argue >>>>>>> that >>>>>>>>>>>> infinity is very different as there is no discussion on >>>> what >>>>>> the >>>>>>>>>> ordering >>>>>>>>>>>> or pattern of infinity is. Everyone agrees that >> `min(1.0, >>>>> inf, >>>>>>>> -inf) >>>>>>>>> == >>>>>>>>>>>> -inf` and each infinity only has a single bit pattern. >>>>>>>>>>>> >>>>>>>>>>>>> It gives a defined order to every bit pattern and >> thus >>>>>> yields a >>>>>>>>> total >>>>>>>>>>>> order, mathematically speaking, which has value by >> itself. >>>>> With >>>>>>> NaN >>>>>>>>>>> counts, >>>>>>>>>>>> it was still undefined how different bit patterns of >> NaNs >>>>> were >>>>>>>>> supposed >>>>>>>>>>> to >>>>>>>>>>>> be ordered, whether NaN was allowed to have a sign bit, >>>> etc., >>>>>>>> risking >>>>>>>>>>> that >>>>>>>>>>>> different engines could come to different results while >>>>>> filtering >>>>>>>> or >>>>>>>>>>>> sorting values within a file. >>>>>>>>>>>> >>>>>>>>>>>> Since the proposal phrases it as a goal to work >> "regardless >>>>> of >>>>>>> how >>>>>>>>> they >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels >>>>>> out-of-place >>>>>>> to >>>>>>>>> me. >>>>>>>>>>>> Most hardware and most people don't care about total >>>> ordering >>>>>> and >>>>>>>>>> needing >>>>>>>>>>>> to take it into account while filtering using >> statistics >>>>> seems >>>>>>> like >>>>>>>>>>>> preferring the special case instead of the common case. >>>>> Almost >>>>>>>> noone >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL >> engines >>>> that >>>>>>> don't >>>>>>>>>> have >>>>>>>>>>>> IEEE total ordering as their default ordering for >> floats >>>> will >>>>>>> also >>>>>>>>> need >>>>>>>>>>> to >>>>>>>>>>>> do more special handling for this. >>>>>>>>>>>> >>>>>>>>>>>> I also agree with my colleague that doing an approach >> that >>>> is >>>>>> 50% >>>>>>>> of >>>>>>>>>> the >>>>>>>>>>>> way there will make the barrier to improving it to >> what it >>>>>>> actually >>>>>>>>>>> should >>>>>>>>>>>> be later on much higher. >>>>>>>>>>>> >>>>>>>>>>>> As for ways forward, I propose merging the `nan_count` >> and >>>>>> `sort >>>>>>>>>>> ordering` >>>>>>>>>>>> proposals into one to make one proposal, as they are >> linked >>>>>>>> together, >>>>>>>>>> and >>>>>>>>>>>> moving forward with one without knowing what will >> happen to >>>>> the >>>>>>>> other >>>>>>>>>>> seems >>>>>>>>>>>> unwise. From a Polars perspective, having a >> `nan_count` and >>>>>>>> defining >>>>>>>>>> what >>>>>>>>>>>> happens to the `min` and `max` statistics when a page >>>>> contains >>>>>>> only >>>>>>>>>> NaNs >>>>>>>>>>> is >>>>>>>>>>>> enough to allow for all predicate filtering. I think, >> but >>>>>> correct >>>>>>>> me >>>>>>>>>> if I >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that >>>> don't >>>>>> use >>>>>>>>> total >>>>>>>>>>>> ordering. But if you want to be impartial to the >> engine's >>>>>>>>>> floating-point >>>>>>>>>>>> ordering and allow engines with total ordering to do >>>>> inequality >>>>>>>>> filters >>>>>>>>>>>> when `nan_count > 0` you would need a >> `positive_nan_count` >>>>> and >>>>>> a >>>>>>>>>>>> `negative_nan_count`. I understand the downside with >> Thrift >>>>>>>>> complexity, >>>>>>>>>>> but >>>>>>>>>>>> introducing another sort order is also adding >> complexity >>>> just >>>>>> in >>>>>>> a >>>>>>>>>>>> different place. >>>>>>>>>>>> >>>>>>>>>>>> I would really like to see this move forward, so I hope >>>> these >>>>>>>>> concerns >>>>>>>>>>> help >>>>>>>>>>>> move it forward towards a solution that works for >> everyone. >>>>>>>>>>>> >>>>>>>>>>>> Kind regards, >>>>>>>>>>>> Gijs >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb < >>>>>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I would also be in favor of starting a vote >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis < >>>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> As the author of both the IEEE754 total order >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/221> >> PR >>>>> and >>>>>>> the >>>>>>>>>>> earlier >>>>>>>>>>>>> PR >>>>>>>>>>>>>> that basically proposed `nan_count` >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196 >>> , >>>> my >>>>>>>> current >>>>>>>>>> vote >>>>>>>>>>>>> would >>>>>>>>>>>>>> be for IEEE754 total order. >>>>>>>>>>>>>> Consequently, I would like to request a formal >> vote for >>>>> the >>>>>>> PR >>>>>>>>>>>>> introducing >>>>>>>>>>>>>> IEEE754 total order ( >>>>>>>>>>> https://github.com/apache/parquet-format/pull/221 >>>>>>>>>>>> ), >>>>>>>>>>>>>> if >>>>>>>>>>>>>> that is possible. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My Rationales: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - It's conceptually simpler. It's easier to >> explain. >>>>>> It's >>>>>>>>> based >>>>>>>>>> on >>>>>>>>>>>> an >>>>>>>>>>>>>> IEEE-standardized order predicate. >>>>>>>>>>>>>> - There are already multiple implementations >> showing >>>>>>>>>> feasibility. >>>>>>>>>>>> This >>>>>>>>>>>>>> will likely make the adoption quicker. >>>>>>>>>>>>>> - It gives a defined order to every bit pattern >> and >>>>> thus >>>>>>>>> yields >>>>>>>>>> a >>>>>>>>>>>>> total >>>>>>>>>>>>>> order, mathematically speaking, which has value >> by >>>>>> itself. >>>>>>>>> With >>>>>>>>>>> NaN >>>>>>>>>>>>>> counts, >>>>>>>>>>>>>> it was still undefined how different bit >> patterns of >>>>>> NaNs >>>>>>>> were >>>>>>>>>>>>> supposed >>>>>>>>>>>>>> to >>>>>>>>>>>>>> be ordered, whether NaN was allowed to have a >> sign >>>>> bit, >>>>>>>> etc., >>>>>>>>>>>> risking >>>>>>>>>>>>>> that >>>>>>>>>>>>>> different engines could come to different >> results >>>>> while >>>>>>>>>> filtering >>>>>>>>>>> or >>>>>>>>>>>>>> sorting values within a file. >>>>>>>>>>>>>> - It also solves sort order completely. With >>>>> nan_counts >>>>>>>> only, >>>>>>>>> it >>>>>>>>>>> is >>>>>>>>>>>>>> still undefined whether nans should be sorted >> before >>>>> or >>>>>>>> after >>>>>>>>>> all >>>>>>>>>>>>> values >>>>>>>>>>>>>> (or both, depending on sign bit), so any file >>>>> including >>>>>>> NaNs >>>>>>>>>> could >>>>>>>>>>>> not >>>>>>>>>>>>>> really leverage sort order without being >> ambiguous. >>>>>>>>>>>>>> - It's less complex in thrift. Having fields >> that >>>> only >>>>>>> apply >>>>>>>>> to >>>>>>>>>> a >>>>>>>>>>>>>> handful of data types is somehow weird. If every >>>> type >>>>>> did >>>>>>>>> this, >>>>>>>>>> we >>>>>>>>>>>>> would >>>>>>>>>>>>>> have a plethora of non-generic fields in thrift. >>>>>>>>>>>>>> - Treating NaNs so specially is giving them >>>> attention >>>>>> they >>>>>>>>> don't >>>>>>>>>>>>>> deserve. Most data sets do not contain NaNs. If >> a >>>> use >>>>>> case >>>>>>>>>> really >>>>>>>>>>>>>> requires >>>>>>>>>>>>>> them and needs filtering to ignore them, they >> can >>>>> store >>>>>>> NULL >>>>>>>>>>>> instead, >>>>>>>>>>>>>> or encode them differently. I would prefer the >>>> average >>>>>>> case >>>>>>>>> over >>>>>>>>>>> the >>>>>>>>>>>>>> special case here. >>>>>>>>>>>>>> - The majority of the people discussing this so >> far >>>>> seem >>>>>>> to >>>>>>>>>> favor >>>>>>>>>>>>> total >>>>>>>>>>>>>> order. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> Jan >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu >> < >>>>>>>>>> [email protected] >>>>>>>>>>>> : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As this discussion has been open for more than >> two >>>>> years, >>>>>>> I’d >>>>>>>>>> like >>>>>>>>>>> to >>>>>>>>>>>>>> bump >>>>>>>>>>>>>>> up >>>>>>>>>>>>>>> this thread again to update the progress and >> collect >>>>>>>> feedback. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Background* >>>>>>>>>>>>>>> • Today Parquet’s min/max stats and page index >> omit >>>>> NaNs >>>>>>>>>> entirely. >>>>>>>>>>>>>>> • Engines can’t safely prune floating values >> because >>>>> they >>>>>>>> know >>>>>>>>>>>> nothing >>>>>>>>>>>>> on >>>>>>>>>>>>>>> NaNs. >>>>>>>>>>>>>>> • Column index is disabled if any page contains >> only >>>>>> NaNs. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> There are two active proposals as below: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Proposal A - IEEE754TotalOrder* (from the PR >> [1]) >>>>>>>>>>>>>>> • Define a new ColumnOrder to include +0, –0 and >> all >>>>> NaN >>>>>>>>>>>> bit‐patterns. >>>>>>>>>>>>>>> • Stats and column index store NaNs if they >> appear. >>>>>>>>>>>>>>> • Three PoC impls are ready: arrow-rs [2], >> duckdb [3] >>>>> and >>>>>>>>>>>> parquet-java >>>>>>>>>>>>>> [4]. >>>>>>>>>>>>>>> • For more context of this approach, please >> refer to >>>>>>>> discussion >>>>>>>>>> in >>>>>>>>>>>> [5]. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Proposal B - add nan_count* (from a comment [6] >> to >>>>> [1]) >>>>>>>>>>>>>>> • Add `nan_count` to stats and a `nan_counts` >> list to >>>>>>> column >>>>>>>>>> index. >>>>>>>>>>>>>>> • For all‐NaNs cases, write NaN to min/max and >> use >>>>>>> nan_count >>>>>>>> to >>>>>>>>>>>>>>> distinguish. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Both solutions have pros and cons but are way >> better >>>>> than >>>>>>> the >>>>>>>>>>> status >>>>>>>>>>>>> quo >>>>>>>>>>>>>>> today. >>>>>>>>>>>>>>> Please share your thoughts on the two proposals >>>> above, >>>>> or >>>>>>>> maybe >>>>>>>>>>> come >>>>>>>>>>>> up >>>>>>>>>>>>>>> with >>>>>>>>>>>>>>> better alternatives. We need consensus on one >>>> proposal >>>>>> and >>>>>>>> move >>>>>>>>>>>>> forward. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>> https://github.com/apache/parquet-format/pull/221 >>>>>>>>>>>>>>> [2] https://github.com/apache/arrow-rs/pull/7408 >>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder >>>>>>>>>>>>>>> [4] >> https://github.com/apache/parquet-java/pull/3191 >>>>>>>>>>>>>>> [5] >>>> https://github.com/apache/parquet-format/pull/196 >>>>>>>>>>>>>>> [6] >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Gang >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis < >>>>>>> [email protected] >>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dear contributors, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My PR has now gathered comments for a week and >> the >>>>> gist >>>>>>> of >>>>>>>>> all >>>>>>>>>>> open >>>>>>>>>>>>>>> issues >>>>>>>>>>>>>>>> is the question of how to encode pages/column >>>> chunks >>>>>> that >>>>>>>>>> contain >>>>>>>>>>>>> only >>>>>>>>>>>>>>>> NaNs. There are different suggestions and I >> don't >>>> see >>>>>> one >>>>>>>>>> common >>>>>>>>>>>>>> favorite >>>>>>>>>>>>>>>> yet. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have outlined three alternatives of how we >> can >>>>> handle >>>>>>>> these >>>>>>>>>>> and I >>>>>>>>>>>>>> want >>>>>>>>>>>>>>> us >>>>>>>>>>>>>>>> to reach a conclusion here, so I can update my >> PR >>>>>>>> accordingly >>>>>>>>>> and >>>>>>>>>>>>> move >>>>>>>>>>>>>> on >>>>>>>>>>>>>>>> with it. As this is my first contribution to >>>>> parquet, I >>>>>>>> don't >>>>>>>>>>> know >>>>>>>>>>>>> the >>>>>>>>>>>>>>>> decision processes here. Do we vote? Is there a >>>>> single >>>>>> or >>>>>>>>> group >>>>>>>>>>> of >>>>>>>>>>>>>>> decision >>>>>>>>>>>>>>>> makers? *Please let me know how to come to a >>>>> conclusion >>>>>>>> here; >>>>>>>>>>> what >>>>>>>>>>>>> are >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> next steps?* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For reference, here are the three alternatives >> I >>>>>> pointed >>>>>>>> out. >>>>>>>>>> You >>>>>>>>>>>> can >>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>> detailed description of their PROs and CONs in >> my >>>>>>> comment: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. My initial proposal, i.e., encoding only-NaN >>>> pages >>>>>> by >>>>>>>>>>>> min=max=NaN. >>>>>>>>>>>>>>>> 2. Adding `num_values` to the ColumnIndex, to >> make >>>> it >>>>>>>>> symmetric >>>>>>>>>>>> with >>>>>>>>>>>>>>>> Statistics in pages & `ColumnMetaData` and to >>>> enable >>>>>> the >>>>>>>>>>>> computation >>>>>>>>>>>>>>>> `num_values - null_count - nan_count == 0` >>>>>>>>>>>>>>>> 3. Adding a `nan_pages` bool list to the column >>>>> index, >>>>>>>> which >>>>>>>>>>>>> indicates >>>>>>>>>>>>>>>> whether a page contains only NaNs >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers >>>>>>>>>>>>>>>> Jan Finis >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>
