Update: Java POC is ready for IEEE 754 column order combined with nan_count: https://github.com/apache/parquet-java/pull/3393
The spec PR has been updated earlier to address all comments: https://github.com/apache/parquet-format/pull/514 Really appreciate any review and feedback! Best, Gang On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote: > Hello all, > > I'm reaching out to help drive this long-running discussion—nearly > three years now—towards a final resolution. With Jan's authorization, > and my sincere thanks for his sustained effort, I want to help push > this issue to the finish line. > > To recap, we have two primary proposals on how to handle NaNs in > statistics and column indexes: > > * IEEE 754 Total Order [1]: Proposes adding a new column order > IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined > ordering for every float bit pattern, including NaNs and -0/+0, > allowing writers to include NaNs in min/max and removing ambiguity for > only-NaN pages. > * Combined Approach [2]: Proposes adopting the IEEE 754 total order > alongside explicit nan_count(s) fields. This approach mandates the > nan_count(s) when the new order is used and clarifies how to handle > edge cases from legacy writers. > > Based on the recent comments, it appears the combined approach [2] is > gaining consensus, although the IEEE 754 total order [1] still has > strong advocates. > > I agree with the sentiment that technical direction should be made by > consensus, not a vote. To that end, I'd like to solicit further > feedback specifically on the combined approach [2] to see if we can > achieve the necessary consensus to move forward now. > > I recall that the total order proposal [1] already has three PoC > implementations. For the combined approach [2], I can draft a PoC in > parquet-java, but to meet the two-implementation requirement, we would > need one more contributor to step up. > > [1] https://github.com/apache/parquet-format/pull/221 > [2] https://github.com/apache/parquet-format/pull/514 > > Best, > Gang > > > On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn <[email protected]> > wrote: > > > > Hello Jan, > > > > Thank you for pushing this through. Apart from some smaller nits, we also > > really like the current proposal. > > > > Thanks, > > Gijs > > > > On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]> > wrote: > > > > > I have started organizing a project[1] in arrow-rs 's Parquet reader > to try > > > and implement this proposal. > > > > > > Hopefully that can be 1 / 2 open source implementations needed. > > > > > > Thanks again for helping drive this along, > > > Andrew > > > > > > [1] https://github.com/apache/arrow-rs/issues/8156 > > > > > > On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote: > > > > > > > I have now tagged > > > > < > > > > https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173 > > > > > > > > > the people that argued for total order in the initial PR. Let's see > their > > > > response. > > > > > > > > If I understand the adoption process correctly, the next hurdle to > > > getting > > > > this adopted is two open > > > > source (!) implementations proving its feasibility. We already had > that > > > for > > > > IEEE total order. If we > > > > prefer the solution with nan counts, we'll need it there as well. I > > > myself > > > > work on a proprietary > > > > implementation, so I'm counting on others here :). Be prepared > though, > > > this > > > > will likely take months > > > > unless the interest in this topic has risen to a point where people > are > > > > eager to jump on the implementation > > > > right away. > > > > > > > > So, I guess it will take some months of soaking time before any > formal > > > vote > > > > can be done > > > > (given that we reach consensus that this is what we want and we find > > > people > > > > for the implementations). > > > > > > > > Cheers, > > > > Jan > > > > > > > > Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue < > [email protected]>: > > > > > > > > > Thanks, Jan. I also went through the combined proposal and it looks > > > > mostly > > > > > good to me. > > > > > > > > > > > First of all, to make it quick: Yes, the solution of having > > > nan_counts > > > > > *and* total order, which was brought up multiple times, does work > and > > > > > solves more cases than just either of both. > > > > > > > > > > Great, then we have a solution for both filtering use cases and for > > > > moving > > > > > ahead with total order. And thanks to Andrew for suggesting this as > > > well > > > > on > > > > > the second PR. I think this also looks like this is something that > > > Orson > > > > is > > > > > okay with given his comments on the latest PR. > > > > > > > > > > Is there anyone against the combined approach? I don't see a big > > > downside > > > > > for anyone. It is compatible with previous stats rules, has a NaN > > > count, > > > > > and allows using either type-specific order or total order. > > > > > > > > > > Assuming that this satisfies the big objections, I think we should > wait > > > > for > > > > > a few days to make sure everyone has time to check out the new PR > and > > > > then > > > > > vote to adopt it. > > > > > > > > > > Ryan > > > > > > > > > > On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb < > [email protected]> > > > > > wrote: > > > > > > > > > > > Thank you Jan -- I read through the new combined proposal, and I > > > > thought > > > > > it > > > > > > looks good and addresses the feedback so far. I left some small > style > > > > > > suggestions, but nothing that is required from my perspective > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]> > wrote: > > > > > > > > > > > > > Hey Ryan, > > > > > > > > > > > > > > Thanks for chiming in. First of all, to make it quick: Yes, the > > > > > solution > > > > > > of > > > > > > > having nan_counts *and* total order, which was brought up > multiple > > > > > times, > > > > > > > does work and solves more cases than just either of both. > > > > > > > > > > > > > > I strongly prefer continuing to discuss the merits of these > > > > approaches > > > > > > > > rather than trying to decide with a vote. > > > > > > > > > > > > > > > > > > > > > In theory, I agree that it isn't good to silence a discussion > by > > > just > > > > > > > voting for one possible solution and technical issues should be > > > > > > discussed. > > > > > > > However, please note that we have been circling on this for > over > > > two > > > > > > years > > > > > > > now, including an extended discussion that brought up all > arguments > > > > > > > multiple times. This is in stark contrast to the > > > > > > > speed with which you guys work on the Iceberg spec, for > example. > > > > There, > > > > > > you > > > > > > > also do not discuss the merits of various solutions for > multiple > > > > years. > > > > > > You > > > > > > > just pick one and merge it after a *reasonable* time of > discussion. > > > > > > > If you had the speed we currently have here, nothing would get > > > done. > > > > > > Thus, > > > > > > > I see this as a clear case of *"the perfect is the enemy of the > > > > good"*. > > > > > > > Yes, we can continue looking for the perfect solution, > > > > > > > but that will likely lead to keeping us at the status quo, > which is > > > > the > > > > > > > worst of them all. > > > > > > > > > > > > > > That being said, I'm also happy to create a PR which does both > > > total > > > > > > order > > > > > > > and NaN counts; after all, I just want the issue solved and all > > > these > > > > > > > solutions are better than the status quo. > > > > > > > > > > > > > > *As this was now suggest by at least three people, I guess it's > > > worth > > > > > > > doing, so here you go: > > > > > https://github.com/apache/parquet-format/pull/514 > > > > > > > <https://github.com/apache/parquet-format/pull/514>* > > > > > > > > > > > > > > With this, we should have PRs covering most of the solution > space. > > > > > > > (I'm refusing to create a PR with negative and positive > nan_counts; > > > > > > > nan_counts + total order has to suffice; the complexity > madness has > > > > to > > > > > > stop > > > > > > > somewhere) > > > > > > > I still believe that there was an amount of people who already > > > found > > > > > > > nan_counts too complex and therefore wanted IEEE total order, > and > > > > these > > > > > > > people may not like putting on extra complexity, > > > > > > > but let's see, maybe some have also changed their opinion in > the > > > > > > meantime. > > > > > > > > > > > > > > > > > > > > > *Given all this, we can also first do an informal vote where > > > everyone > > > > > can > > > > > > > vote for which of the three their favorite would be.Maybe a > clear > > > > > > favorite > > > > > > > will emerge and then we can vote on this one.* > > > > > > > > > > > > > > But of course, we can also take some weeks to discuss the three > > > > > > solutions, > > > > > > > now that we have PRs for all of them. I just hope this won't > make > > > us > > > > > > > continue for another 2 years, or an > > > > > > > infinite stalemate where each solution is vetoed by a PMC > member. > > > > > > > (Sorry for becoming a bit cynical here; I have just spent way > too > > > > much > > > > > > time > > > > > > > of my life with double statistics at this point ;) ...) > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > Jan > > > > > > > > > > > > > > Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue < > > > > [email protected] > > > > > >: > > > > > > > > > > > > > > > Regarding the process for this, I strongly prefer continuing > to > > > > > discuss > > > > > > > the > > > > > > > > merits of these approaches rather than trying to decide with > a > > > > vote. > > > > > I > > > > > > > > don't think it is a good practice to use a vote to decide on > a > > > > > > technical > > > > > > > > direction. There are very few situations that warrant it and > I > > > > don't > > > > > > > think > > > > > > > > that this is one of them. While this issue has been open for > a > > > long > > > > > > time, > > > > > > > > that appears to be the result of it not being anyone's top > > > priority > > > > > > > rather > > > > > > > > than indecision. > > > > > > > > > > > > > > > > For the technical merits of these approaches, I think that > we can > > > > > find > > > > > > a > > > > > > > > middle ground. I agree with Jan that when working with sorted > > > > values, > > > > > > we > > > > > > > > need to know how NaN values were handled and that requires > using > > > a > > > > > > > > well-defined order that includes NaN and its variations > (because > > > we > > > > > > > should > > > > > > > > not normalize). Using NaN count is not sufficient for > ordering > > > > rows. > > > > > > > > > > > > > > > > Gijs also brings up good points about how NaN values show up > in > > > > > actual > > > > > > > > datasets: not just when used in place of null, but also as > the > > > > result > > > > > > of > > > > > > > > normal calculations on abnormal data, like `sqrt(-4.0)` or > > > > > `log(-1.0)`. > > > > > > > > Both of those present problems when mixed with valid data > because > > > > of > > > > > > the > > > > > > > > stats "poisoning" problem, where the range of valid data is > > > usable > > > > > > until > > > > > > > a > > > > > > > > single NaN is mixed in. > > > > > > > > > > > > > > > > Another issue is that NaN is error-prone because "regular" > > > > comparison > > > > > > is > > > > > > > > always false: > > > > > > > > ``` > > > > > > > > Math.log(-1.0) >= 2 => FALSE > > > > > > > > Math.log(-1.0) < 2 => FALSE > > > > > > > > 2 > Math.log(-1.0) => FALSE > > > > > > > > ``` > > > > > > > > > > > > > > > > As a result, Iceberg doesn't trust NaN values as either > lower or > > > > > upper > > > > > > > > bounds because we don't want to go back to the code that > produced > > > > the > > > > > > > value > > > > > > > > to see what the comparison order was to determine whether NaN > > > > values > > > > > go > > > > > > > > before or after others. > > > > > > > > > > > > > > > > Total order solves the second issue in theory, but regular > > > > comparison > > > > > > is > > > > > > > > prevalent and not obvious to developers. And it also doesn't > help > > > > > when > > > > > > > NaN > > > > > > > > is used instead of null. So using total order is not > sufficient > > > for > > > > > > data > > > > > > > > skipping. > > > > > > > > > > > > > > > > I think the right compromise is to use `min`, `max`, and > > > > `nan_count` > > > > > > for > > > > > > > > data skipping stats (where min and max cannot be NaN) and > total > > > > > > ordering > > > > > > > > for sorting values. That satisfies the data skipping use > cases > > > and > > > > > also > > > > > > > > gives us an ordering of unaltered values that we can reason > > > about. > > > > > > > > > > > > > > > > Does anyone think that doesn't work? > > > > > > > > > > > > > > > > Ryan > > > > > > > > > > > > > > > > On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]> > wrote: > > > > > > > > > > > > > > > > > Thanks Jan for your endless effort on this! > > > > > > > > > > > > > > > > > > I'm in favor of simplicity and generalism. I think we have > > > > already > > > > > > > > debated > > > > > > > > > a lot > > > > > > > > > for `nan_count` in [1] and [2] is the reflection of those > > > > > > discussions. > > > > > > > > > Therefore > > > > > > > > > I am inclined to start a vote for [2] unless there is a > > > > > significantly > > > > > > > > > better > > > > > > > > > proposal. > > > > > > > > > > > > > > > > > > I would suggest everyone interested in this discussion to > > > attend > > > > > the > > > > > > > > > scheduled > > > > > > > > > sync on Aug 6th (detailed below) to spread the word to the > > > > broader > > > > > > > > > community. > > > > > > > > > If we can get a consensus on [2], I can help start the > vote and > > > > > move > > > > > > > > > forward. > > > > > > > > > > > > > > > > > > *Apache Parquet Community Sync Wednesday, August 6 · 10:00 > – > > > > > 11:00am > > > > > > * > > > > > > > > > *Time zone: America/Los_Angeles* > > > > > > > > > *Google Meet joining info Video call link: > > > > > > > > > https://meet.google.com/bhe-rvan-qjk > > > > > > > > > <https://meet.google.com/bhe-rvan-qjk> * > > > > > > > > > > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/196 > > > > > > > > > [2] https://github.com/apache/parquet-format/pull/221 > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Gang > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 1, 2025 at 6:16 PM Jan Finis < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Gijs, > > > > > > > > > > > > > > > > > > > > Thank you for bringing up concrete points, I'm happy to > > > discuss > > > > > > them > > > > > > > in > > > > > > > > > > detail. > > > > > > > > > > > > > > > > > > > > NaNs are less common in the SQL world than in the > DataFrame > > > > world > > > > > > > where > > > > > > > > > > > NaNs were used for a long time to represent missing > values. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You could transcode between NULL to NaN before reading > and > > > > > writing > > > > > > to > > > > > > > > > > Parquet. You basically mention yourself that NaNs were > used > > > for > > > > > > > missing > > > > > > > > > > values, i.e., what is commonly a NULL, which wasn't > > > available. > > > > > So, > > > > > > > > > > semantically, transcoding to NULL would even be the sane > > > thing > > > > to > > > > > > do. > > > > > > > > > Yes, > > > > > > > > > > that will cost you some cycles, but should be a rather > > > > > lightweight > > > > > > > > > > operation in comparison to most other operations, so I > would > > > > > argue > > > > > > > that > > > > > > > > > it > > > > > > > > > > won't totally ruin your performance. Similarly, why > should > > > > > Parquet > > > > > > > play > > > > > > > > > > along with a "hack" that was done in other frameworks > due to > > > > > > > > shortcomings > > > > > > > > > > of those frameworks? So from a philosophical point of > view, I > > > > > think > > > > > > > > > > supporting NaNs better is the wrong thing to do. Rather, > we > > > > > should > > > > > > > be a > > > > > > > > > > forcing function to align others to better behavior, so > > > > appling a > > > > > > bit > > > > > > > > of > > > > > > > > > > force might in the long run make people use NULLs also in > > > > > > DataFrames. > > > > > > > > > > > > > > > > > > > > Of course, your argument also goes into the direction of > > > > > > pragmatism: > > > > > > > > If a > > > > > > > > > > large part of the data science world uses NaNs to encode > > > > missing > > > > > > > > values, > > > > > > > > > > then maybe Parquet should accept this de-facto standard > > > rather > > > > > than > > > > > > > > > > fighting it. That is indeed a valid point. The weight of > it > > > is > > > > > > > > debatable > > > > > > > > > > and my personal conclusion is that it's still not worth > it, > > > as > > > > > you > > > > > > > can > > > > > > > > > > transcode between NULLs and NaNs, but I do agree with its > > > > > validity. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since the proposal phrases it as a goal to work > "regardless > > > of > > > > > how > > > > > > > they > > > > > > > > > > > order NaN w.r.t. other values" this statement feels > > > > > out-of-place > > > > > > to > > > > > > > > me. > > > > > > > > > > > Most hardware and most people don't care about total > > > ordering > > > > > and > > > > > > > > > needing > > > > > > > > > > > to take it into account while filtering using > statistics > > > > seems > > > > > > like > > > > > > > > > > > preferring the special case instead of the common case. > > > > Almost > > > > > > > noone > > > > > > > > > > > filters for specific NaN value bit-patterns. SQL > engines > > > that > > > > > > don't > > > > > > > > > have > > > > > > > > > > > IEEE total ordering as their default ordering for > floats > > > will > > > > > > also > > > > > > > > need > > > > > > > > > > to > > > > > > > > > > > do more special handling for this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I disagree with the conclusion this statement draws. The > > > > current > > > > > > > > > behavior, > > > > > > > > > > and nan_counts without total ordering, pose a real > problem > > > > here, > > > > > > even > > > > > > > > for > > > > > > > > > > engines that don't care about bit patterns. I do agree > that > > > > most > > > > > > > > database > > > > > > > > > > engines, including the one I'm working on, do not care > about > > > > bit > > > > > > > > patterns > > > > > > > > > > and/or sign bits. However, how can our database engine > know > > > > > whether > > > > > > > the > > > > > > > > > > writer of a Parquet file saw it the same way? It can't. > > > > > Therefore, > > > > > > it > > > > > > > > > > cannot know whether a writer, for example, ordered NaNs > > > before > > > > or > > > > > > > after > > > > > > > > > all > > > > > > > > > > other numbers, or maybe ordered them by sign bit. So, if > our > > > > > > database > > > > > > > > > > engine now sees a float column in sorting columns, it > cannot > > > > > apply > > > > > > > any > > > > > > > > > > optimization without a lot of special casing, as it > doesn't > > > > know > > > > > > > > whether > > > > > > > > > > NaNs will be before all other values, after all other > values, > > > > or > > > > > > > maybe > > > > > > > > > > both, depending on sign bit. It could apply contrived > logic > > > > that > > > > > > > tries > > > > > > > > to > > > > > > > > > > infer where NaNs were placed from the NaN counts of the > first > > > > and > > > > > > > last > > > > > > > > > > page, but doing so will be a lot of ugly code that also > feels > > > > to > > > > > be > > > > > > > in > > > > > > > > > the > > > > > > > > > > wrong place. I.e., I don't want to need to load pages or > the > > > > page > > > > > > > > index, > > > > > > > > > > just to reason about a sort order. > > > > > > > > > > > > > > > > > > > > SQL engines that don't have > > > > > > > > > > > IEEE total ordering as their default ordering for > floats > > > will > > > > > > also > > > > > > > > need > > > > > > > > > > to > > > > > > > > > > > do more special handling for this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This code, which I would indeed need to write for our > engine, > > > > is > > > > > > > > > comparably > > > > > > > > > > trivial. Simply choose the largest possible bit pattern > as > > > > > > comparison > > > > > > > > for > > > > > > > > > > upper bounds filtering for NaN, and the smallest > possible bit > > > > > > pattern > > > > > > > > for > > > > > > > > > > lower bounds. It's not more than a few lines of code that > > > check > > > > > > > > whether a > > > > > > > > > > filter is NaN and then replace its value with the > > > > highest/lowest > > > > > > NaN > > > > > > > > bit > > > > > > > > > > pattern. It is similarly trivial to the special casing I > need > > > > to > > > > > do > > > > > > > > with > > > > > > > > > > nan_counts, and it is way more trivial than the extra > code I > > > > > would > > > > > > > need > > > > > > > > > to > > > > > > > > > > write for sorting columns, as depicted above. > > > > > > > > > > > > > > > > > > > > From a Polars perspective, having a `nan_count` and > defining > > > > what > > > > > > > > > > > happens to the `min` and `max` statistics when a page > > > > contains > > > > > > only > > > > > > > > > NaNs > > > > > > > > > > is > > > > > > > > > > > enough to allow for all predicate filtering. I think, > but > > > > > correct > > > > > > > me > > > > > > > > > if I > > > > > > > > > > > am wrong, this is also enough for all SQL engines that > > > don't > > > > > use > > > > > > > > total > > > > > > > > > > > ordering. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's not fully enough, as depicted above. Sorting columns > > > would > > > > > > still > > > > > > > > not > > > > > > > > > > work properly. > > > > > > > > > > > > > > > > > > > > As for ways forward, I propose merging the `nan_count` > and > > > > `sort > > > > > > > > > ordering` > > > > > > > > > > > proposals into one to make one proposal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Note that the initial reason for proposing IEEE total > order > > > was > > > > > > that > > > > > > > > > people > > > > > > > > > > in the discussion threads found nan_counts to be too > complex > > > > and > > > > > > too > > > > > > > > much > > > > > > > > > > of an undeserving special case (re-read the discussion > in the > > > > > > initial > > > > > > > > PR > > > > > > > > > > <https://github.com/apache/parquet-format/pull/196> to > see > > > the > > > > > > > > > > rationales). > > > > > > > > > > So merging both together would go totally against the > spirit > > > of > > > > > why > > > > > > > > IEEE > > > > > > > > > > total order was proposed. While it has further upsides, > the > > > > main > > > > > > > reason > > > > > > > > > was > > > > > > > > > > indeed to *not have* nan_counts. If now the proposal > would > > > even > > > > > go > > > > > > to > > > > > > > > > > positive and negative nan counts (i.e., even more > > > complexity), > > > > > this > > > > > > > > would > > > > > > > > > > go 180 degrees into the opposite direction of why people > > > wanted > > > > > > total > > > > > > > > > order > > > > > > > > > > in the first place. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Jan > > > > > > > > > > > > > > > > > > > > Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn > > > > > > > > > > <[email protected]>: > > > > > > > > > > > > > > > > > > > > > Hello Jan and others, > > > > > > > > > > > > > > > > > > > > > > First, let me preface by saying I am quite new here. > So I > > > > > > apologize > > > > > > > > if > > > > > > > > > > > there is some other better way to bring up these > concerns. > > > I > > > > > > > > understand > > > > > > > > > > it > > > > > > > > > > > is very annoying to come in at the 11th hour and start > > > > bringing > > > > > > up > > > > > > > a > > > > > > > > > > bunch > > > > > > > > > > > of concerns, but I would also like this to be done > right. A > > > > > > > colleague > > > > > > > > > of > > > > > > > > > > > mine brought up some concerns and alternative > approaches in > > > > the > > > > > > > > GitHub > > > > > > > > > > > thread; I will file some of the concerns here as a > > > response. > > > > > > > > > > > > > > > > > > > > > > > Treating NaNs so specially is giving them attention > they > > > > > don't > > > > > > > > > deserve. > > > > > > > > > > > Most data sets do not contain NaNs. If a use case > really > > > > > requires > > > > > > > > them > > > > > > > > > > and > > > > > > > > > > > needs filtering to ignore them, they can store NULL > > > instead, > > > > or > > > > > > > > encode > > > > > > > > > > them > > > > > > > > > > > differently. I would prefer the average case over the > > > special > > > > > > case > > > > > > > > > here. > > > > > > > > > > > > > > > > > > > > > > NaNs are less common in the SQL world than in the > DataFrame > > > > > world > > > > > > > > where > > > > > > > > > > > NaNs were used for a long time to represent missing > values. > > > > > They > > > > > > > > still > > > > > > > > > > > exist with different canonical representations and > > > different > > > > > sign > > > > > > > > > bits. I > > > > > > > > > > > agree it might not be correct semantically, but sadly > that > > > is > > > > > the > > > > > > > > world > > > > > > > > > > we > > > > > > > > > > > deal with. NumPy and Numba do not have missing data > > > > > > functionality, > > > > > > > > > people > > > > > > > > > > > use NaNs there, and people definitely use that in their > > > > > > analytical > > > > > > > > > > > dataflows. Another point that was brought up in the GH > > > > > discussion > > > > > > > was > > > > > > > > > > "what > > > > > > > > > > > about infinity? You could argue that having infinity in > > > > > > statistics > > > > > > > is > > > > > > > > > > > similarly unuseful as it's too wide of a bound". I > would > > > > argue > > > > > > that > > > > > > > > > > > infinity is very different as there is no discussion on > > > what > > > > > the > > > > > > > > > ordering > > > > > > > > > > > or pattern of infinity is. Everyone agrees that > `min(1.0, > > > > inf, > > > > > > > -inf) > > > > > > > > == > > > > > > > > > > > -inf` and each infinity only has a single bit pattern. > > > > > > > > > > > > > > > > > > > > > > > It gives a defined order to every bit pattern and > thus > > > > > yields a > > > > > > > > total > > > > > > > > > > > order, mathematically speaking, which has value by > itself. > > > > With > > > > > > NaN > > > > > > > > > > counts, > > > > > > > > > > > it was still undefined how different bit patterns of > NaNs > > > > were > > > > > > > > supposed > > > > > > > > > > to > > > > > > > > > > > be ordered, whether NaN was allowed to have a sign bit, > > > etc., > > > > > > > risking > > > > > > > > > > that > > > > > > > > > > > different engines could come to different results while > > > > > filtering > > > > > > > or > > > > > > > > > > > sorting values within a file. > > > > > > > > > > > > > > > > > > > > > > Since the proposal phrases it as a goal to work > "regardless > > > > of > > > > > > how > > > > > > > > they > > > > > > > > > > > order NaN w.r.t. other values" this statement feels > > > > > out-of-place > > > > > > to > > > > > > > > me. > > > > > > > > > > > Most hardware and most people don't care about total > > > ordering > > > > > and > > > > > > > > > needing > > > > > > > > > > > to take it into account while filtering using > statistics > > > > seems > > > > > > like > > > > > > > > > > > preferring the special case instead of the common case. > > > > Almost > > > > > > > noone > > > > > > > > > > > filters for specific NaN value bit-patterns. SQL > engines > > > that > > > > > > don't > > > > > > > > > have > > > > > > > > > > > IEEE total ordering as their default ordering for > floats > > > will > > > > > > also > > > > > > > > need > > > > > > > > > > to > > > > > > > > > > > do more special handling for this. > > > > > > > > > > > > > > > > > > > > > > I also agree with my colleague that doing an approach > that > > > is > > > > > 50% > > > > > > > of > > > > > > > > > the > > > > > > > > > > > way there will make the barrier to improving it to > what it > > > > > > actually > > > > > > > > > > should > > > > > > > > > > > be later on much higher. > > > > > > > > > > > > > > > > > > > > > > As for ways forward, I propose merging the `nan_count` > and > > > > > `sort > > > > > > > > > > ordering` > > > > > > > > > > > proposals into one to make one proposal, as they are > linked > > > > > > > together, > > > > > > > > > and > > > > > > > > > > > moving forward with one without knowing what will > happen to > > > > the > > > > > > > other > > > > > > > > > > seems > > > > > > > > > > > unwise. From a Polars perspective, having a > `nan_count` and > > > > > > > defining > > > > > > > > > what > > > > > > > > > > > happens to the `min` and `max` statistics when a page > > > > contains > > > > > > only > > > > > > > > > NaNs > > > > > > > > > > is > > > > > > > > > > > enough to allow for all predicate filtering. I think, > but > > > > > correct > > > > > > > me > > > > > > > > > if I > > > > > > > > > > > am wrong, this is also enough for all SQL engines that > > > don't > > > > > use > > > > > > > > total > > > > > > > > > > > ordering. But if you want to be impartial to the > engine's > > > > > > > > > floating-point > > > > > > > > > > > ordering and allow engines with total ordering to do > > > > inequality > > > > > > > > filters > > > > > > > > > > > when `nan_count > 0` you would need a > `positive_nan_count` > > > > and > > > > > a > > > > > > > > > > > `negative_nan_count`. I understand the downside with > Thrift > > > > > > > > complexity, > > > > > > > > > > but > > > > > > > > > > > introducing another sort order is also adding > complexity > > > just > > > > > in > > > > > > a > > > > > > > > > > > different place. > > > > > > > > > > > > > > > > > > > > > > I would really like to see this move forward, so I hope > > > these > > > > > > > > concerns > > > > > > > > > > help > > > > > > > > > > > move it forward towards a solution that works for > everyone. > > > > > > > > > > > > > > > > > > > > > > Kind regards, > > > > > > > > > > > Gijs > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb < > > > > > > > [email protected]> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I would also be in favor of starting a vote > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis < > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > As the author of both the IEEE754 total order > > > > > > > > > > > > > <https://github.com/apache/parquet-format/pull/221> > PR > > > > and > > > > > > the > > > > > > > > > > earlier > > > > > > > > > > > > PR > > > > > > > > > > > > > that basically proposed `nan_count` > > > > > > > > > > > > > <https://github.com/apache/parquet-format/pull/196 > >, > > > my > > > > > > > current > > > > > > > > > vote > > > > > > > > > > > > would > > > > > > > > > > > > > be for IEEE754 total order. > > > > > > > > > > > > > Consequently, I would like to request a formal > vote for > > > > the > > > > > > PR > > > > > > > > > > > > introducing > > > > > > > > > > > > > IEEE754 total order ( > > > > > > > > > > https://github.com/apache/parquet-format/pull/221 > > > > > > > > > > > ), > > > > > > > > > > > > > if > > > > > > > > > > > > > that is possible. > > > > > > > > > > > > > > > > > > > > > > > > > > My Rationales: > > > > > > > > > > > > > > > > > > > > > > > > > > - It's conceptually simpler. It's easier to > explain. > > > > > It's > > > > > > > > based > > > > > > > > > on > > > > > > > > > > > an > > > > > > > > > > > > > IEEE-standardized order predicate. > > > > > > > > > > > > > - There are already multiple implementations > showing > > > > > > > > > feasibility. > > > > > > > > > > > This > > > > > > > > > > > > > will likely make the adoption quicker. > > > > > > > > > > > > > - It gives a defined order to every bit pattern > and > > > > thus > > > > > > > > yields > > > > > > > > > a > > > > > > > > > > > > total > > > > > > > > > > > > > order, mathematically speaking, which has value > by > > > > > itself. > > > > > > > > With > > > > > > > > > > NaN > > > > > > > > > > > > > counts, > > > > > > > > > > > > > it was still undefined how different bit > patterns of > > > > > NaNs > > > > > > > were > > > > > > > > > > > > supposed > > > > > > > > > > > > > to > > > > > > > > > > > > > be ordered, whether NaN was allowed to have a > sign > > > > bit, > > > > > > > etc., > > > > > > > > > > > risking > > > > > > > > > > > > > that > > > > > > > > > > > > > different engines could come to different > results > > > > while > > > > > > > > > filtering > > > > > > > > > > or > > > > > > > > > > > > > sorting values within a file. > > > > > > > > > > > > > - It also solves sort order completely. With > > > > nan_counts > > > > > > > only, > > > > > > > > it > > > > > > > > > > is > > > > > > > > > > > > > still undefined whether nans should be sorted > before > > > > or > > > > > > > after > > > > > > > > > all > > > > > > > > > > > > values > > > > > > > > > > > > > (or both, depending on sign bit), so any file > > > > including > > > > > > NaNs > > > > > > > > > could > > > > > > > > > > > not > > > > > > > > > > > > > really leverage sort order without being > ambiguous. > > > > > > > > > > > > > - It's less complex in thrift. Having fields > that > > > only > > > > > > apply > > > > > > > > to > > > > > > > > > a > > > > > > > > > > > > > handful of data types is somehow weird. If every > > > type > > > > > did > > > > > > > > this, > > > > > > > > > we > > > > > > > > > > > > would > > > > > > > > > > > > > have a plethora of non-generic fields in thrift. > > > > > > > > > > > > > - Treating NaNs so specially is giving them > > > attention > > > > > they > > > > > > > > don't > > > > > > > > > > > > > deserve. Most data sets do not contain NaNs. If > a > > > use > > > > > case > > > > > > > > > really > > > > > > > > > > > > > requires > > > > > > > > > > > > > them and needs filtering to ignore them, they > can > > > > store > > > > > > NULL > > > > > > > > > > > instead, > > > > > > > > > > > > > or encode them differently. I would prefer the > > > average > > > > > > case > > > > > > > > over > > > > > > > > > > the > > > > > > > > > > > > > special case here. > > > > > > > > > > > > > - The majority of the people discussing this so > far > > > > seem > > > > > > to > > > > > > > > > favor > > > > > > > > > > > > total > > > > > > > > > > > > > order. > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > Jan > > > > > > > > > > > > > > > > > > > > > > > > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu > < > > > > > > > > > [email protected] > > > > > > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > > > As this discussion has been open for more than > two > > > > years, > > > > > > I’d > > > > > > > > > like > > > > > > > > > > to > > > > > > > > > > > > > bump > > > > > > > > > > > > > > up > > > > > > > > > > > > > > this thread again to update the progress and > collect > > > > > > > feedback. > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Background* > > > > > > > > > > > > > > • Today Parquet’s min/max stats and page index > omit > > > > NaNs > > > > > > > > > entirely. > > > > > > > > > > > > > > • Engines can’t safely prune floating values > because > > > > they > > > > > > > know > > > > > > > > > > > nothing > > > > > > > > > > > > on > > > > > > > > > > > > > > NaNs. > > > > > > > > > > > > > > • Column index is disabled if any page contains > only > > > > > NaNs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > There are two active proposals as below: > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Proposal A - IEEE754TotalOrder* (from the PR > [1]) > > > > > > > > > > > > > > • Define a new ColumnOrder to include +0, –0 and > all > > > > NaN > > > > > > > > > > > bit‐patterns. > > > > > > > > > > > > > > • Stats and column index store NaNs if they > appear. > > > > > > > > > > > > > > • Three PoC impls are ready: arrow-rs [2], > duckdb [3] > > > > and > > > > > > > > > > > parquet-java > > > > > > > > > > > > > [4]. > > > > > > > > > > > > > > • For more context of this approach, please > refer to > > > > > > > discussion > > > > > > > > > in > > > > > > > > > > > [5]. > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Proposal B - add nan_count* (from a comment [6] > to > > > > [1]) > > > > > > > > > > > > > > • Add `nan_count` to stats and a `nan_counts` > list to > > > > > > column > > > > > > > > > index. > > > > > > > > > > > > > > • For all‐NaNs cases, write NaN to min/max and > use > > > > > > nan_count > > > > > > > to > > > > > > > > > > > > > > distinguish. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Both solutions have pros and cons but are way > better > > > > than > > > > > > the > > > > > > > > > > status > > > > > > > > > > > > quo > > > > > > > > > > > > > > today. > > > > > > > > > > > > > > Please share your thoughts on the two proposals > > > above, > > > > or > > > > > > > maybe > > > > > > > > > > come > > > > > > > > > > > up > > > > > > > > > > > > > > with > > > > > > > > > > > > > > better alternatives. We need consensus on one > > > proposal > > > > > and > > > > > > > move > > > > > > > > > > > > forward. > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > https://github.com/apache/parquet-format/pull/221 > > > > > > > > > > > > > > [2] https://github.com/apache/arrow-rs/pull/7408 > > > > > > > > > > > > > > [3] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder > > > > > > > > > > > > > > [4] > https://github.com/apache/parquet-java/pull/3191 > > > > > > > > > > > > > > [5] > > > https://github.com/apache/parquet-format/pull/196 > > > > > > > > > > > > > > [6] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Gang > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis < > > > > > > [email protected] > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear contributors, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My PR has now gathered comments for a week and > the > > > > gist > > > > > > of > > > > > > > > all > > > > > > > > > > open > > > > > > > > > > > > > > issues > > > > > > > > > > > > > > > is the question of how to encode pages/column > > > chunks > > > > > that > > > > > > > > > contain > > > > > > > > > > > > only > > > > > > > > > > > > > > > NaNs. There are different suggestions and I > don't > > > see > > > > > one > > > > > > > > > common > > > > > > > > > > > > > favorite > > > > > > > > > > > > > > > yet. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have outlined three alternatives of how we > can > > > > handle > > > > > > > these > > > > > > > > > > and I > > > > > > > > > > > > > want > > > > > > > > > > > > > > us > > > > > > > > > > > > > > > to reach a conclusion here, so I can update my > PR > > > > > > > accordingly > > > > > > > > > and > > > > > > > > > > > > move > > > > > > > > > > > > > on > > > > > > > > > > > > > > > with it. As this is my first contribution to > > > > parquet, I > > > > > > > don't > > > > > > > > > > know > > > > > > > > > > > > the > > > > > > > > > > > > > > > decision processes here. Do we vote? Is there a > > > > single > > > > > or > > > > > > > > group > > > > > > > > > > of > > > > > > > > > > > > > > decision > > > > > > > > > > > > > > > makers? *Please let me know how to come to a > > > > conclusion > > > > > > > here; > > > > > > > > > > what > > > > > > > > > > > > are > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > next steps?* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For reference, here are the three alternatives > I > > > > > pointed > > > > > > > out. > > > > > > > > > You > > > > > > > > > > > can > > > > > > > > > > > > > > find > > > > > > > > > > > > > > > detailed description of their PROs and CONs in > my > > > > > > comment: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. My initial proposal, i.e., encoding only-NaN > > > pages > > > > > by > > > > > > > > > > > min=max=NaN. > > > > > > > > > > > > > > > 2. Adding `num_values` to the ColumnIndex, to > make > > > it > > > > > > > > symmetric > > > > > > > > > > > with > > > > > > > > > > > > > > > Statistics in pages & `ColumnMetaData` and to > > > enable > > > > > the > > > > > > > > > > > computation > > > > > > > > > > > > > > > `num_values - null_count - nan_count == 0` > > > > > > > > > > > > > > > 3. Adding a `nan_pages` bool list to the column > > > > index, > > > > > > > which > > > > > > > > > > > > indicates > > > > > > > > > > > > > > > whether a page contains only NaNs > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > Jan Finis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
