Update on the progress of PARQUET-2249. We now have two complete PoC implementations for the combined IEEE 754 total order and nan_count approach: - Java: https://github.com/apache/parquet-java/pull/3393 - Rust: https://github.com/apache/arrow-rs/pull/9619 (Thanks Ed!)
The spec PR is available here: https://github.com/apache/parquet-format/pull/514 We have also added a test file to parquet-testing for interoperability tests, which has been verified by both parquet-java and arrow-rs: https://github.com/apache/parquet-testing/pull/104 I'd like to encourage everyone to take another look at the current proposal and implementation. Any feedback or suggestions are welcome. If there are no further objections, I will move forward with a formal vote. Best regards, Gang On Mon, Mar 16, 2026 at 11:30 AM Gang Wu <[email protected]> wrote: > Thanks Zehua! Really appreciate it! > > On Mon, Mar 16, 2026 at 10:40 AM Zehua Zou <[email protected]> wrote: > >> Hello Gang and others, >> >> I am willing to implement the C++ POC. >> >> >> >> > 2026年3月14日 23:56,Gang Wu <[email protected]> 写道: >> > >> > Update: >> > >> > Java POC is ready for IEEE 754 column order combined with nan_count: >> > https://github.com/apache/parquet-java/pull/3393 >> > >> > The spec PR has been updated earlier to address all comments: >> > https://github.com/apache/parquet-format/pull/514 >> > >> > Really appreciate any review and feedback! >> > >> > Best, >> > Gang >> > >> > >> > >> > >> > On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote: >> > >> >> Hello all, >> >> >> >> I'm reaching out to help drive this long-running discussion—nearly >> >> three years now—towards a final resolution. With Jan's authorization, >> >> and my sincere thanks for his sustained effort, I want to help push >> >> this issue to the finish line. >> >> >> >> To recap, we have two primary proposals on how to handle NaNs in >> >> statistics and column indexes: >> >> >> >> * IEEE 754 Total Order [1]: Proposes adding a new column order >> >> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined >> >> ordering for every float bit pattern, including NaNs and -0/+0, >> >> allowing writers to include NaNs in min/max and removing ambiguity for >> >> only-NaN pages. >> >> * Combined Approach [2]: Proposes adopting the IEEE 754 total order >> >> alongside explicit nan_count(s) fields. This approach mandates the >> >> nan_count(s) when the new order is used and clarifies how to handle >> >> edge cases from legacy writers. >> >> >> >> Based on the recent comments, it appears the combined approach [2] is >> >> gaining consensus, although the IEEE 754 total order [1] still has >> >> strong advocates. >> >> >> >> I agree with the sentiment that technical direction should be made by >> >> consensus, not a vote. To that end, I'd like to solicit further >> >> feedback specifically on the combined approach [2] to see if we can >> >> achieve the necessary consensus to move forward now. >> >> >> >> I recall that the total order proposal [1] already has three PoC >> >> implementations. For the combined approach [2], I can draft a PoC in >> >> parquet-java, but to meet the two-implementation requirement, we would >> >> need one more contributor to step up. >> >> >> >> [1] https://github.com/apache/parquet-format/pull/221 >> >> [2] https://github.com/apache/parquet-format/pull/514 >> >> >> >> Best, >> >> Gang >> >> >> >> >> >> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn >> <[email protected]> >> >> wrote: >> >>> >> >>> Hello Jan, >> >>> >> >>> Thank you for pushing this through. Apart from some smaller nits, we >> also >> >>> really like the current proposal. >> >>> >> >>> Thanks, >> >>> Gijs >> >>> >> >>> On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]> >> >> wrote: >> >>> >> >>>> I have started organizing a project[1] in arrow-rs 's Parquet reader >> >> to try >> >>>> and implement this proposal. >> >>>> >> >>>> Hopefully that can be 1 / 2 open source implementations needed. >> >>>> >> >>>> Thanks again for helping drive this along, >> >>>> Andrew >> >>>> >> >>>> [1] https://github.com/apache/arrow-rs/issues/8156 >> >>>> >> >>>> On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote: >> >>>> >> >>>>> I have now tagged >> >>>>> < >> >>>> >> >> >> https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173 >> >>>>>> >> >>>>> the people that argued for total order in the initial PR. Let's see >> >> their >> >>>>> response. >> >>>>> >> >>>>> If I understand the adoption process correctly, the next hurdle to >> >>>> getting >> >>>>> this adopted is two open >> >>>>> source (!) implementations proving its feasibility. We already had >> >> that >> >>>> for >> >>>>> IEEE total order. If we >> >>>>> prefer the solution with nan counts, we'll need it there as well. I >> >>>> myself >> >>>>> work on a proprietary >> >>>>> implementation, so I'm counting on others here :). Be prepared >> >> though, >> >>>> this >> >>>>> will likely take months >> >>>>> unless the interest in this topic has risen to a point where people >> >> are >> >>>>> eager to jump on the implementation >> >>>>> right away. >> >>>>> >> >>>>> So, I guess it will take some months of soaking time before any >> >> formal >> >>>> vote >> >>>>> can be done >> >>>>> (given that we reach consensus that this is what we want and we find >> >>>> people >> >>>>> for the implementations). >> >>>>> >> >>>>> Cheers, >> >>>>> Jan >> >>>>> >> >>>>> Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue < >> >> [email protected]>: >> >>>>> >> >>>>>> Thanks, Jan. I also went through the combined proposal and it looks >> >>>>> mostly >> >>>>>> good to me. >> >>>>>> >> >>>>>>> First of all, to make it quick: Yes, the solution of having >> >>>> nan_counts >> >>>>>> *and* total order, which was brought up multiple times, does work >> >> and >> >>>>>> solves more cases than just either of both. >> >>>>>> >> >>>>>> Great, then we have a solution for both filtering use cases and for >> >>>>> moving >> >>>>>> ahead with total order. And thanks to Andrew for suggesting this as >> >>>> well >> >>>>> on >> >>>>>> the second PR. I think this also looks like this is something that >> >>>> Orson >> >>>>> is >> >>>>>> okay with given his comments on the latest PR. >> >>>>>> >> >>>>>> Is there anyone against the combined approach? I don't see a big >> >>>> downside >> >>>>>> for anyone. It is compatible with previous stats rules, has a NaN >> >>>> count, >> >>>>>> and allows using either type-specific order or total order. >> >>>>>> >> >>>>>> Assuming that this satisfies the big objections, I think we should >> >> wait >> >>>>> for >> >>>>>> a few days to make sure everyone has time to check out the new PR >> >> and >> >>>>> then >> >>>>>> vote to adopt it. >> >>>>>> >> >>>>>> Ryan >> >>>>>> >> >>>>>> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb < >> >> [email protected]> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> Thank you Jan -- I read through the new combined proposal, and I >> >>>>> thought >> >>>>>> it >> >>>>>>> looks good and addresses the feedback so far. I left some small >> >> style >> >>>>>>> suggestions, but nothing that is required from my perspective >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]> >> >> wrote: >> >>>>>>> >> >>>>>>>> Hey Ryan, >> >>>>>>>> >> >>>>>>>> Thanks for chiming in. First of all, to make it quick: Yes, the >> >>>>>> solution >> >>>>>>> of >> >>>>>>>> having nan_counts *and* total order, which was brought up >> >> multiple >> >>>>>> times, >> >>>>>>>> does work and solves more cases than just either of both. >> >>>>>>>> >> >>>>>>>> I strongly prefer continuing to discuss the merits of these >> >>>>> approaches >> >>>>>>>>> rather than trying to decide with a vote. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> In theory, I agree that it isn't good to silence a discussion >> >> by >> >>>> just >> >>>>>>>> voting for one possible solution and technical issues should be >> >>>>>>> discussed. >> >>>>>>>> However, please note that we have been circling on this for >> >> over >> >>>> two >> >>>>>>> years >> >>>>>>>> now, including an extended discussion that brought up all >> >> arguments >> >>>>>>>> multiple times. This is in stark contrast to the >> >>>>>>>> speed with which you guys work on the Iceberg spec, for >> >> example. >> >>>>> There, >> >>>>>>> you >> >>>>>>>> also do not discuss the merits of various solutions for >> >> multiple >> >>>>> years. >> >>>>>>> You >> >>>>>>>> just pick one and merge it after a *reasonable* time of >> >> discussion. >> >>>>>>>> If you had the speed we currently have here, nothing would get >> >>>> done. >> >>>>>>> Thus, >> >>>>>>>> I see this as a clear case of *"the perfect is the enemy of the >> >>>>> good"*. >> >>>>>>>> Yes, we can continue looking for the perfect solution, >> >>>>>>>> but that will likely lead to keeping us at the status quo, >> >> which is >> >>>>> the >> >>>>>>>> worst of them all. >> >>>>>>>> >> >>>>>>>> That being said, I'm also happy to create a PR which does both >> >>>> total >> >>>>>>> order >> >>>>>>>> and NaN counts; after all, I just want the issue solved and all >> >>>> these >> >>>>>>>> solutions are better than the status quo. >> >>>>>>>> >> >>>>>>>> *As this was now suggest by at least three people, I guess it's >> >>>> worth >> >>>>>>>> doing, so here you go: >> >>>>>> https://github.com/apache/parquet-format/pull/514 >> >>>>>>>> <https://github.com/apache/parquet-format/pull/514>* >> >>>>>>>> >> >>>>>>>> With this, we should have PRs covering most of the solution >> >> space. >> >>>>>>>> (I'm refusing to create a PR with negative and positive >> >> nan_counts; >> >>>>>>>> nan_counts + total order has to suffice; the complexity >> >> madness has >> >>>>> to >> >>>>>>> stop >> >>>>>>>> somewhere) >> >>>>>>>> I still believe that there was an amount of people who already >> >>>> found >> >>>>>>>> nan_counts too complex and therefore wanted IEEE total order, >> >> and >> >>>>> these >> >>>>>>>> people may not like putting on extra complexity, >> >>>>>>>> but let's see, maybe some have also changed their opinion in >> >> the >> >>>>>>> meantime. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> *Given all this, we can also first do an informal vote where >> >>>> everyone >> >>>>>> can >> >>>>>>>> vote for which of the three their favorite would be.Maybe a >> >> clear >> >>>>>>> favorite >> >>>>>>>> will emerge and then we can vote on this one.* >> >>>>>>>> >> >>>>>>>> But of course, we can also take some weeks to discuss the three >> >>>>>>> solutions, >> >>>>>>>> now that we have PRs for all of them. I just hope this won't >> >> make >> >>>> us >> >>>>>>>> continue for another 2 years, or an >> >>>>>>>> infinite stalemate where each solution is vetoed by a PMC >> >> member. >> >>>>>>>> (Sorry for becoming a bit cynical here; I have just spent way >> >> too >> >>>>> much >> >>>>>>> time >> >>>>>>>> of my life with double statistics at this point ;) ...) >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Cheers, >> >>>>>>>> Jan >> >>>>>>>> >> >>>>>>>> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue < >> >>>>> [email protected] >> >>>>>>> : >> >>>>>>>> >> >>>>>>>>> Regarding the process for this, I strongly prefer continuing >> >> to >> >>>>>> discuss >> >>>>>>>> the >> >>>>>>>>> merits of these approaches rather than trying to decide with >> >> a >> >>>>> vote. >> >>>>>> I >> >>>>>>>>> don't think it is a good practice to use a vote to decide on >> >> a >> >>>>>>> technical >> >>>>>>>>> direction. There are very few situations that warrant it and >> >> I >> >>>>> don't >> >>>>>>>> think >> >>>>>>>>> that this is one of them. While this issue has been open for >> >> a >> >>>> long >> >>>>>>> time, >> >>>>>>>>> that appears to be the result of it not being anyone's top >> >>>> priority >> >>>>>>>> rather >> >>>>>>>>> than indecision. >> >>>>>>>>> >> >>>>>>>>> For the technical merits of these approaches, I think that >> >> we can >> >>>>>> find >> >>>>>>> a >> >>>>>>>>> middle ground. I agree with Jan that when working with sorted >> >>>>> values, >> >>>>>>> we >> >>>>>>>>> need to know how NaN values were handled and that requires >> >> using >> >>>> a >> >>>>>>>>> well-defined order that includes NaN and its variations >> >> (because >> >>>> we >> >>>>>>>> should >> >>>>>>>>> not normalize). Using NaN count is not sufficient for >> >> ordering >> >>>>> rows. >> >>>>>>>>> >> >>>>>>>>> Gijs also brings up good points about how NaN values show up >> >> in >> >>>>>> actual >> >>>>>>>>> datasets: not just when used in place of null, but also as >> >> the >> >>>>> result >> >>>>>>> of >> >>>>>>>>> normal calculations on abnormal data, like `sqrt(-4.0)` or >> >>>>>> `log(-1.0)`. >> >>>>>>>>> Both of those present problems when mixed with valid data >> >> because >> >>>>> of >> >>>>>>> the >> >>>>>>>>> stats "poisoning" problem, where the range of valid data is >> >>>> usable >> >>>>>>> until >> >>>>>>>> a >> >>>>>>>>> single NaN is mixed in. >> >>>>>>>>> >> >>>>>>>>> Another issue is that NaN is error-prone because "regular" >> >>>>> comparison >> >>>>>>> is >> >>>>>>>>> always false: >> >>>>>>>>> ``` >> >>>>>>>>> Math.log(-1.0) >= 2 => FALSE >> >>>>>>>>> Math.log(-1.0) < 2 => FALSE >> >>>>>>>>> 2 > Math.log(-1.0) => FALSE >> >>>>>>>>> ``` >> >>>>>>>>> >> >>>>>>>>> As a result, Iceberg doesn't trust NaN values as either >> >> lower or >> >>>>>> upper >> >>>>>>>>> bounds because we don't want to go back to the code that >> >> produced >> >>>>> the >> >>>>>>>> value >> >>>>>>>>> to see what the comparison order was to determine whether NaN >> >>>>> values >> >>>>>> go >> >>>>>>>>> before or after others. >> >>>>>>>>> >> >>>>>>>>> Total order solves the second issue in theory, but regular >> >>>>> comparison >> >>>>>>> is >> >>>>>>>>> prevalent and not obvious to developers. And it also doesn't >> >> help >> >>>>>> when >> >>>>>>>> NaN >> >>>>>>>>> is used instead of null. So using total order is not >> >> sufficient >> >>>> for >> >>>>>>> data >> >>>>>>>>> skipping. >> >>>>>>>>> >> >>>>>>>>> I think the right compromise is to use `min`, `max`, and >> >>>>> `nan_count` >> >>>>>>> for >> >>>>>>>>> data skipping stats (where min and max cannot be NaN) and >> >> total >> >>>>>>> ordering >> >>>>>>>>> for sorting values. That satisfies the data skipping use >> >> cases >> >>>> and >> >>>>>> also >> >>>>>>>>> gives us an ordering of unaltered values that we can reason >> >>>> about. >> >>>>>>>>> >> >>>>>>>>> Does anyone think that doesn't work? >> >>>>>>>>> >> >>>>>>>>> Ryan >> >>>>>>>>> >> >>>>>>>>> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]> >> >> wrote: >> >>>>>>>>> >> >>>>>>>>>> Thanks Jan for your endless effort on this! >> >>>>>>>>>> >> >>>>>>>>>> I'm in favor of simplicity and generalism. I think we have >> >>>>> already >> >>>>>>>>> debated >> >>>>>>>>>> a lot >> >>>>>>>>>> for `nan_count` in [1] and [2] is the reflection of those >> >>>>>>> discussions. >> >>>>>>>>>> Therefore >> >>>>>>>>>> I am inclined to start a vote for [2] unless there is a >> >>>>>> significantly >> >>>>>>>>>> better >> >>>>>>>>>> proposal. >> >>>>>>>>>> >> >>>>>>>>>> I would suggest everyone interested in this discussion to >> >>>> attend >> >>>>>> the >> >>>>>>>>>> scheduled >> >>>>>>>>>> sync on Aug 6th (detailed below) to spread the word to the >> >>>>> broader >> >>>>>>>>>> community. >> >>>>>>>>>> If we can get a consensus on [2], I can help start the >> >> vote and >> >>>>>> move >> >>>>>>>>>> forward. >> >>>>>>>>>> >> >>>>>>>>>> *Apache Parquet Community Sync Wednesday, August 6 · 10:00 >> >> – >> >>>>>> 11:00am >> >>>>>>> * >> >>>>>>>>>> *Time zone: America/Los_Angeles* >> >>>>>>>>>> *Google Meet joining info Video call link: >> >>>>>>>>>> https://meet.google.com/bhe-rvan-qjk >> >>>>>>>>>> <https://meet.google.com/bhe-rvan-qjk> * >> >>>>>>>>>> >> >>>>>>>>>> [1] https://github.com/apache/parquet-format/pull/196 >> >>>>>>>>>> [2] https://github.com/apache/parquet-format/pull/221 >> >>>>>>>>>> >> >>>>>>>>>> Best, >> >>>>>>>>>> Gang >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Fri, Aug 1, 2025 at 6:16 PM Jan Finis < >> >> [email protected]> >> >>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> Hi Gijs, >> >>>>>>>>>>> >> >>>>>>>>>>> Thank you for bringing up concrete points, I'm happy to >> >>>> discuss >> >>>>>>> them >> >>>>>>>> in >> >>>>>>>>>>> detail. >> >>>>>>>>>>> >> >>>>>>>>>>> NaNs are less common in the SQL world than in the >> >> DataFrame >> >>>>> world >> >>>>>>>> where >> >>>>>>>>>>>> NaNs were used for a long time to represent missing >> >> values. >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> You could transcode between NULL to NaN before reading >> >> and >> >>>>>> writing >> >>>>>>> to >> >>>>>>>>>>> Parquet. You basically mention yourself that NaNs were >> >> used >> >>>> for >> >>>>>>>> missing >> >>>>>>>>>>> values, i.e., what is commonly a NULL, which wasn't >> >>>> available. >> >>>>>> So, >> >>>>>>>>>>> semantically, transcoding to NULL would even be the sane >> >>>> thing >> >>>>> to >> >>>>>>> do. >> >>>>>>>>>> Yes, >> >>>>>>>>>>> that will cost you some cycles, but should be a rather >> >>>>>> lightweight >> >>>>>>>>>>> operation in comparison to most other operations, so I >> >> would >> >>>>>> argue >> >>>>>>>> that >> >>>>>>>>>> it >> >>>>>>>>>>> won't totally ruin your performance. Similarly, why >> >> should >> >>>>>> Parquet >> >>>>>>>> play >> >>>>>>>>>>> along with a "hack" that was done in other frameworks >> >> due to >> >>>>>>>>> shortcomings >> >>>>>>>>>>> of those frameworks? So from a philosophical point of >> >> view, I >> >>>>>> think >> >>>>>>>>>>> supporting NaNs better is the wrong thing to do. Rather, >> >> we >> >>>>>> should >> >>>>>>>> be a >> >>>>>>>>>>> forcing function to align others to better behavior, so >> >>>>> appling a >> >>>>>>> bit >> >>>>>>>>> of >> >>>>>>>>>>> force might in the long run make people use NULLs also in >> >>>>>>> DataFrames. >> >>>>>>>>>>> >> >>>>>>>>>>> Of course, your argument also goes into the direction of >> >>>>>>> pragmatism: >> >>>>>>>>> If a >> >>>>>>>>>>> large part of the data science world uses NaNs to encode >> >>>>> missing >> >>>>>>>>> values, >> >>>>>>>>>>> then maybe Parquet should accept this de-facto standard >> >>>> rather >> >>>>>> than >> >>>>>>>>>>> fighting it. That is indeed a valid point. The weight of >> >> it >> >>>> is >> >>>>>>>>> debatable >> >>>>>>>>>>> and my personal conclusion is that it's still not worth >> >> it, >> >>>> as >> >>>>>> you >> >>>>>>>> can >> >>>>>>>>>>> transcode between NULLs and NaNs, but I do agree with its >> >>>>>> validity. >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Since the proposal phrases it as a goal to work >> >> "regardless >> >>>> of >> >>>>>> how >> >>>>>>>> they >> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels >> >>>>>> out-of-place >> >>>>>>> to >> >>>>>>>>> me. >> >>>>>>>>>>>> Most hardware and most people don't care about total >> >>>> ordering >> >>>>>> and >> >>>>>>>>>> needing >> >>>>>>>>>>>> to take it into account while filtering using >> >> statistics >> >>>>> seems >> >>>>>>> like >> >>>>>>>>>>>> preferring the special case instead of the common case. >> >>>>> Almost >> >>>>>>>> noone >> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL >> >> engines >> >>>> that >> >>>>>>> don't >> >>>>>>>>>> have >> >>>>>>>>>>>> IEEE total ordering as their default ordering for >> >> floats >> >>>> will >> >>>>>>> also >> >>>>>>>>> need >> >>>>>>>>>>> to >> >>>>>>>>>>>> do more special handling for this. >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> I disagree with the conclusion this statement draws. The >> >>>>> current >> >>>>>>>>>> behavior, >> >>>>>>>>>>> and nan_counts without total ordering, pose a real >> >> problem >> >>>>> here, >> >>>>>>> even >> >>>>>>>>> for >> >>>>>>>>>>> engines that don't care about bit patterns. I do agree >> >> that >> >>>>> most >> >>>>>>>>> database >> >>>>>>>>>>> engines, including the one I'm working on, do not care >> >> about >> >>>>> bit >> >>>>>>>>> patterns >> >>>>>>>>>>> and/or sign bits. However, how can our database engine >> >> know >> >>>>>> whether >> >>>>>>>> the >> >>>>>>>>>>> writer of a Parquet file saw it the same way? It can't. >> >>>>>> Therefore, >> >>>>>>> it >> >>>>>>>>>>> cannot know whether a writer, for example, ordered NaNs >> >>>> before >> >>>>> or >> >>>>>>>> after >> >>>>>>>>>> all >> >>>>>>>>>>> other numbers, or maybe ordered them by sign bit. So, if >> >> our >> >>>>>>> database >> >>>>>>>>>>> engine now sees a float column in sorting columns, it >> >> cannot >> >>>>>> apply >> >>>>>>>> any >> >>>>>>>>>>> optimization without a lot of special casing, as it >> >> doesn't >> >>>>> know >> >>>>>>>>> whether >> >>>>>>>>>>> NaNs will be before all other values, after all other >> >> values, >> >>>>> or >> >>>>>>>> maybe >> >>>>>>>>>>> both, depending on sign bit. It could apply contrived >> >> logic >> >>>>> that >> >>>>>>>> tries >> >>>>>>>>> to >> >>>>>>>>>>> infer where NaNs were placed from the NaN counts of the >> >> first >> >>>>> and >> >>>>>>>> last >> >>>>>>>>>>> page, but doing so will be a lot of ugly code that also >> >> feels >> >>>>> to >> >>>>>> be >> >>>>>>>> in >> >>>>>>>>>> the >> >>>>>>>>>>> wrong place. I.e., I don't want to need to load pages or >> >> the >> >>>>> page >> >>>>>>>>> index, >> >>>>>>>>>>> just to reason about a sort order. >> >>>>>>>>>>> >> >>>>>>>>>>> SQL engines that don't have >> >>>>>>>>>>>> IEEE total ordering as their default ordering for >> >> floats >> >>>> will >> >>>>>>> also >> >>>>>>>>> need >> >>>>>>>>>>> to >> >>>>>>>>>>>> do more special handling for this. >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> This code, which I would indeed need to write for our >> >> engine, >> >>>>> is >> >>>>>>>>>> comparably >> >>>>>>>>>>> trivial. Simply choose the largest possible bit pattern >> >> as >> >>>>>>> comparison >> >>>>>>>>> for >> >>>>>>>>>>> upper bounds filtering for NaN, and the smallest >> >> possible bit >> >>>>>>> pattern >> >>>>>>>>> for >> >>>>>>>>>>> lower bounds. It's not more than a few lines of code that >> >>>> check >> >>>>>>>>> whether a >> >>>>>>>>>>> filter is NaN and then replace its value with the >> >>>>> highest/lowest >> >>>>>>> NaN >> >>>>>>>>> bit >> >>>>>>>>>>> pattern. It is similarly trivial to the special casing I >> >> need >> >>>>> to >> >>>>>> do >> >>>>>>>>> with >> >>>>>>>>>>> nan_counts, and it is way more trivial than the extra >> >> code I >> >>>>>> would >> >>>>>>>> need >> >>>>>>>>>> to >> >>>>>>>>>>> write for sorting columns, as depicted above. >> >>>>>>>>>>> >> >>>>>>>>>>> From a Polars perspective, having a `nan_count` and >> >> defining >> >>>>> what >> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page >> >>>>> contains >> >>>>>>> only >> >>>>>>>>>> NaNs >> >>>>>>>>>>> is >> >>>>>>>>>>>> enough to allow for all predicate filtering. I think, >> >> but >> >>>>>> correct >> >>>>>>>> me >> >>>>>>>>>> if I >> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that >> >>>> don't >> >>>>>> use >> >>>>>>>>> total >> >>>>>>>>>>>> ordering. >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> It's not fully enough, as depicted above. Sorting columns >> >>>> would >> >>>>>>> still >> >>>>>>>>> not >> >>>>>>>>>>> work properly. >> >>>>>>>>>>> >> >>>>>>>>>>> As for ways forward, I propose merging the `nan_count` >> >> and >> >>>>> `sort >> >>>>>>>>>> ordering` >> >>>>>>>>>>>> proposals into one to make one proposal >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Note that the initial reason for proposing IEEE total >> >> order >> >>>> was >> >>>>>>> that >> >>>>>>>>>> people >> >>>>>>>>>>> in the discussion threads found nan_counts to be too >> >> complex >> >>>>> and >> >>>>>>> too >> >>>>>>>>> much >> >>>>>>>>>>> of an undeserving special case (re-read the discussion >> >> in the >> >>>>>>> initial >> >>>>>>>>> PR >> >>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196> to >> >> see >> >>>> the >> >>>>>>>>>>> rationales). >> >>>>>>>>>>> So merging both together would go totally against the >> >> spirit >> >>>> of >> >>>>>> why >> >>>>>>>>> IEEE >> >>>>>>>>>>> total order was proposed. While it has further upsides, >> >> the >> >>>>> main >> >>>>>>>> reason >> >>>>>>>>>> was >> >>>>>>>>>>> indeed to *not have* nan_counts. If now the proposal >> >> would >> >>>> even >> >>>>>> go >> >>>>>>> to >> >>>>>>>>>>> positive and negative nan counts (i.e., even more >> >>>> complexity), >> >>>>>> this >> >>>>>>>>> would >> >>>>>>>>>>> go 180 degrees into the opposite direction of why people >> >>>> wanted >> >>>>>>> total >> >>>>>>>>>> order >> >>>>>>>>>>> in the first place. >> >>>>>>>>>>> >> >>>>>>>>>>> Cheers, >> >>>>>>>>>>> Jan >> >>>>>>>>>>> >> >>>>>>>>>>> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn >> >>>>>>>>>>> <[email protected]>: >> >>>>>>>>>>> >> >>>>>>>>>>>> Hello Jan and others, >> >>>>>>>>>>>> >> >>>>>>>>>>>> First, let me preface by saying I am quite new here. >> >> So I >> >>>>>>> apologize >> >>>>>>>>> if >> >>>>>>>>>>>> there is some other better way to bring up these >> >> concerns. >> >>>> I >> >>>>>>>>> understand >> >>>>>>>>>>> it >> >>>>>>>>>>>> is very annoying to come in at the 11th hour and start >> >>>>> bringing >> >>>>>>> up >> >>>>>>>> a >> >>>>>>>>>>> bunch >> >>>>>>>>>>>> of concerns, but I would also like this to be done >> >> right. A >> >>>>>>>> colleague >> >>>>>>>>>> of >> >>>>>>>>>>>> mine brought up some concerns and alternative >> >> approaches in >> >>>>> the >> >>>>>>>>> GitHub >> >>>>>>>>>>>> thread; I will file some of the concerns here as a >> >>>> response. >> >>>>>>>>>>>> >> >>>>>>>>>>>>> Treating NaNs so specially is giving them attention >> >> they >> >>>>>> don't >> >>>>>>>>>> deserve. >> >>>>>>>>>>>> Most data sets do not contain NaNs. If a use case >> >> really >> >>>>>> requires >> >>>>>>>>> them >> >>>>>>>>>>> and >> >>>>>>>>>>>> needs filtering to ignore them, they can store NULL >> >>>> instead, >> >>>>> or >> >>>>>>>>> encode >> >>>>>>>>>>> them >> >>>>>>>>>>>> differently. I would prefer the average case over the >> >>>> special >> >>>>>>> case >> >>>>>>>>>> here. >> >>>>>>>>>>>> >> >>>>>>>>>>>> NaNs are less common in the SQL world than in the >> >> DataFrame >> >>>>>> world >> >>>>>>>>> where >> >>>>>>>>>>>> NaNs were used for a long time to represent missing >> >> values. >> >>>>>> They >> >>>>>>>>> still >> >>>>>>>>>>>> exist with different canonical representations and >> >>>> different >> >>>>>> sign >> >>>>>>>>>> bits. I >> >>>>>>>>>>>> agree it might not be correct semantically, but sadly >> >> that >> >>>> is >> >>>>>> the >> >>>>>>>>> world >> >>>>>>>>>>> we >> >>>>>>>>>>>> deal with. NumPy and Numba do not have missing data >> >>>>>>> functionality, >> >>>>>>>>>> people >> >>>>>>>>>>>> use NaNs there, and people definitely use that in their >> >>>>>>> analytical >> >>>>>>>>>>>> dataflows. Another point that was brought up in the GH >> >>>>>> discussion >> >>>>>>>> was >> >>>>>>>>>>> "what >> >>>>>>>>>>>> about infinity? You could argue that having infinity in >> >>>>>>> statistics >> >>>>>>>> is >> >>>>>>>>>>>> similarly unuseful as it's too wide of a bound". I >> >> would >> >>>>> argue >> >>>>>>> that >> >>>>>>>>>>>> infinity is very different as there is no discussion on >> >>>> what >> >>>>>> the >> >>>>>>>>>> ordering >> >>>>>>>>>>>> or pattern of infinity is. Everyone agrees that >> >> `min(1.0, >> >>>>> inf, >> >>>>>>>> -inf) >> >>>>>>>>> == >> >>>>>>>>>>>> -inf` and each infinity only has a single bit pattern. >> >>>>>>>>>>>> >> >>>>>>>>>>>>> It gives a defined order to every bit pattern and >> >> thus >> >>>>>> yields a >> >>>>>>>>> total >> >>>>>>>>>>>> order, mathematically speaking, which has value by >> >> itself. >> >>>>> With >> >>>>>>> NaN >> >>>>>>>>>>> counts, >> >>>>>>>>>>>> it was still undefined how different bit patterns of >> >> NaNs >> >>>>> were >> >>>>>>>>> supposed >> >>>>>>>>>>> to >> >>>>>>>>>>>> be ordered, whether NaN was allowed to have a sign bit, >> >>>> etc., >> >>>>>>>> risking >> >>>>>>>>>>> that >> >>>>>>>>>>>> different engines could come to different results while >> >>>>>> filtering >> >>>>>>>> or >> >>>>>>>>>>>> sorting values within a file. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Since the proposal phrases it as a goal to work >> >> "regardless >> >>>>> of >> >>>>>>> how >> >>>>>>>>> they >> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels >> >>>>>> out-of-place >> >>>>>>> to >> >>>>>>>>> me. >> >>>>>>>>>>>> Most hardware and most people don't care about total >> >>>> ordering >> >>>>>> and >> >>>>>>>>>> needing >> >>>>>>>>>>>> to take it into account while filtering using >> >> statistics >> >>>>> seems >> >>>>>>> like >> >>>>>>>>>>>> preferring the special case instead of the common case. >> >>>>> Almost >> >>>>>>>> noone >> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL >> >> engines >> >>>> that >> >>>>>>> don't >> >>>>>>>>>> have >> >>>>>>>>>>>> IEEE total ordering as their default ordering for >> >> floats >> >>>> will >> >>>>>>> also >> >>>>>>>>> need >> >>>>>>>>>>> to >> >>>>>>>>>>>> do more special handling for this. >> >>>>>>>>>>>> >> >>>>>>>>>>>> I also agree with my colleague that doing an approach >> >> that >> >>>> is >> >>>>>> 50% >> >>>>>>>> of >> >>>>>>>>>> the >> >>>>>>>>>>>> way there will make the barrier to improving it to >> >> what it >> >>>>>>> actually >> >>>>>>>>>>> should >> >>>>>>>>>>>> be later on much higher. >> >>>>>>>>>>>> >> >>>>>>>>>>>> As for ways forward, I propose merging the `nan_count` >> >> and >> >>>>>> `sort >> >>>>>>>>>>> ordering` >> >>>>>>>>>>>> proposals into one to make one proposal, as they are >> >> linked >> >>>>>>>> together, >> >>>>>>>>>> and >> >>>>>>>>>>>> moving forward with one without knowing what will >> >> happen to >> >>>>> the >> >>>>>>>> other >> >>>>>>>>>>> seems >> >>>>>>>>>>>> unwise. From a Polars perspective, having a >> >> `nan_count` and >> >>>>>>>> defining >> >>>>>>>>>> what >> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page >> >>>>> contains >> >>>>>>> only >> >>>>>>>>>> NaNs >> >>>>>>>>>>> is >> >>>>>>>>>>>> enough to allow for all predicate filtering. I think, >> >> but >> >>>>>> correct >> >>>>>>>> me >> >>>>>>>>>> if I >> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that >> >>>> don't >> >>>>>> use >> >>>>>>>>> total >> >>>>>>>>>>>> ordering. But if you want to be impartial to the >> >> engine's >> >>>>>>>>>> floating-point >> >>>>>>>>>>>> ordering and allow engines with total ordering to do >> >>>>> inequality >> >>>>>>>>> filters >> >>>>>>>>>>>> when `nan_count > 0` you would need a >> >> `positive_nan_count` >> >>>>> and >> >>>>>> a >> >>>>>>>>>>>> `negative_nan_count`. I understand the downside with >> >> Thrift >> >>>>>>>>> complexity, >> >>>>>>>>>>> but >> >>>>>>>>>>>> introducing another sort order is also adding >> >> complexity >> >>>> just >> >>>>>> in >> >>>>>>> a >> >>>>>>>>>>>> different place. >> >>>>>>>>>>>> >> >>>>>>>>>>>> I would really like to see this move forward, so I hope >> >>>> these >> >>>>>>>>> concerns >> >>>>>>>>>>> help >> >>>>>>>>>>>> move it forward towards a solution that works for >> >> everyone. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Kind regards, >> >>>>>>>>>>>> Gijs >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb < >> >>>>>>>> [email protected]> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>>> I would also be in favor of starting a vote >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis < >> >>>>>> [email protected]> >> >>>>>>>>>> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> As the author of both the IEEE754 total order >> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/221> >> >> PR >> >>>>> and >> >>>>>>> the >> >>>>>>>>>>> earlier >> >>>>>>>>>>>>> PR >> >>>>>>>>>>>>>> that basically proposed `nan_count` >> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196 >> >>> , >> >>>> my >> >>>>>>>> current >> >>>>>>>>>> vote >> >>>>>>>>>>>>> would >> >>>>>>>>>>>>>> be for IEEE754 total order. >> >>>>>>>>>>>>>> Consequently, I would like to request a formal >> >> vote for >> >>>>> the >> >>>>>>> PR >> >>>>>>>>>>>>> introducing >> >>>>>>>>>>>>>> IEEE754 total order ( >> >>>>>>>>>>> https://github.com/apache/parquet-format/pull/221 >> >>>>>>>>>>>> ), >> >>>>>>>>>>>>>> if >> >>>>>>>>>>>>>> that is possible. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> My Rationales: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> - It's conceptually simpler. It's easier to >> >> explain. >> >>>>>> It's >> >>>>>>>>> based >> >>>>>>>>>> on >> >>>>>>>>>>>> an >> >>>>>>>>>>>>>> IEEE-standardized order predicate. >> >>>>>>>>>>>>>> - There are already multiple implementations >> >> showing >> >>>>>>>>>> feasibility. >> >>>>>>>>>>>> This >> >>>>>>>>>>>>>> will likely make the adoption quicker. >> >>>>>>>>>>>>>> - It gives a defined order to every bit pattern >> >> and >> >>>>> thus >> >>>>>>>>> yields >> >>>>>>>>>> a >> >>>>>>>>>>>>> total >> >>>>>>>>>>>>>> order, mathematically speaking, which has value >> >> by >> >>>>>> itself. >> >>>>>>>>> With >> >>>>>>>>>>> NaN >> >>>>>>>>>>>>>> counts, >> >>>>>>>>>>>>>> it was still undefined how different bit >> >> patterns of >> >>>>>> NaNs >> >>>>>>>> were >> >>>>>>>>>>>>> supposed >> >>>>>>>>>>>>>> to >> >>>>>>>>>>>>>> be ordered, whether NaN was allowed to have a >> >> sign >> >>>>> bit, >> >>>>>>>> etc., >> >>>>>>>>>>>> risking >> >>>>>>>>>>>>>> that >> >>>>>>>>>>>>>> different engines could come to different >> >> results >> >>>>> while >> >>>>>>>>>> filtering >> >>>>>>>>>>> or >> >>>>>>>>>>>>>> sorting values within a file. >> >>>>>>>>>>>>>> - It also solves sort order completely. With >> >>>>> nan_counts >> >>>>>>>> only, >> >>>>>>>>> it >> >>>>>>>>>>> is >> >>>>>>>>>>>>>> still undefined whether nans should be sorted >> >> before >> >>>>> or >> >>>>>>>> after >> >>>>>>>>>> all >> >>>>>>>>>>>>> values >> >>>>>>>>>>>>>> (or both, depending on sign bit), so any file >> >>>>> including >> >>>>>>> NaNs >> >>>>>>>>>> could >> >>>>>>>>>>>> not >> >>>>>>>>>>>>>> really leverage sort order without being >> >> ambiguous. >> >>>>>>>>>>>>>> - It's less complex in thrift. Having fields >> >> that >> >>>> only >> >>>>>>> apply >> >>>>>>>>> to >> >>>>>>>>>> a >> >>>>>>>>>>>>>> handful of data types is somehow weird. If every >> >>>> type >> >>>>>> did >> >>>>>>>>> this, >> >>>>>>>>>> we >> >>>>>>>>>>>>> would >> >>>>>>>>>>>>>> have a plethora of non-generic fields in thrift. >> >>>>>>>>>>>>>> - Treating NaNs so specially is giving them >> >>>> attention >> >>>>>> they >> >>>>>>>>> don't >> >>>>>>>>>>>>>> deserve. Most data sets do not contain NaNs. If >> >> a >> >>>> use >> >>>>>> case >> >>>>>>>>>> really >> >>>>>>>>>>>>>> requires >> >>>>>>>>>>>>>> them and needs filtering to ignore them, they >> >> can >> >>>>> store >> >>>>>>> NULL >> >>>>>>>>>>>> instead, >> >>>>>>>>>>>>>> or encode them differently. I would prefer the >> >>>> average >> >>>>>>> case >> >>>>>>>>> over >> >>>>>>>>>>> the >> >>>>>>>>>>>>>> special case here. >> >>>>>>>>>>>>>> - The majority of the people discussing this so >> >> far >> >>>>> seem >> >>>>>>> to >> >>>>>>>>>> favor >> >>>>>>>>>>>>> total >> >>>>>>>>>>>>>> order. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Cheers, >> >>>>>>>>>>>>>> Jan >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu >> >> < >> >>>>>>>>>> [email protected] >> >>>>>>>>>>>> : >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Hi all, >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> As this discussion has been open for more than >> >> two >> >>>>> years, >> >>>>>>> I’d >> >>>>>>>>>> like >> >>>>>>>>>>> to >> >>>>>>>>>>>>>> bump >> >>>>>>>>>>>>>>> up >> >>>>>>>>>>>>>>> this thread again to update the progress and >> >> collect >> >>>>>>>> feedback. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Background* >> >>>>>>>>>>>>>>> • Today Parquet’s min/max stats and page index >> >> omit >> >>>>> NaNs >> >>>>>>>>>> entirely. >> >>>>>>>>>>>>>>> • Engines can’t safely prune floating values >> >> because >> >>>>> they >> >>>>>>>> know >> >>>>>>>>>>>> nothing >> >>>>>>>>>>>>> on >> >>>>>>>>>>>>>>> NaNs. >> >>>>>>>>>>>>>>> • Column index is disabled if any page contains >> >> only >> >>>>>> NaNs. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> There are two active proposals as below: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Proposal A - IEEE754TotalOrder* (from the PR >> >> [1]) >> >>>>>>>>>>>>>>> • Define a new ColumnOrder to include +0, –0 and >> >> all >> >>>>> NaN >> >>>>>>>>>>>> bit‐patterns. >> >>>>>>>>>>>>>>> • Stats and column index store NaNs if they >> >> appear. >> >>>>>>>>>>>>>>> • Three PoC impls are ready: arrow-rs [2], >> >> duckdb [3] >> >>>>> and >> >>>>>>>>>>>> parquet-java >> >>>>>>>>>>>>>> [4]. >> >>>>>>>>>>>>>>> • For more context of this approach, please >> >> refer to >> >>>>>>>> discussion >> >>>>>>>>>> in >> >>>>>>>>>>>> [5]. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Proposal B - add nan_count* (from a comment [6] >> >> to >> >>>>> [1]) >> >>>>>>>>>>>>>>> • Add `nan_count` to stats and a `nan_counts` >> >> list to >> >>>>>>> column >> >>>>>>>>>> index. >> >>>>>>>>>>>>>>> • For all‐NaNs cases, write NaN to min/max and >> >> use >> >>>>>>> nan_count >> >>>>>>>> to >> >>>>>>>>>>>>>>> distinguish. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Both solutions have pros and cons but are way >> >> better >> >>>>> than >> >>>>>>> the >> >>>>>>>>>>> status >> >>>>>>>>>>>>> quo >> >>>>>>>>>>>>>>> today. >> >>>>>>>>>>>>>>> Please share your thoughts on the two proposals >> >>>> above, >> >>>>> or >> >>>>>>>> maybe >> >>>>>>>>>>> come >> >>>>>>>>>>>> up >> >>>>>>>>>>>>>>> with >> >>>>>>>>>>>>>>> better alternatives. We need consensus on one >> >>>> proposal >> >>>>>> and >> >>>>>>>> move >> >>>>>>>>>>>>> forward. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> [1] >> >>>> https://github.com/apache/parquet-format/pull/221 >> >>>>>>>>>>>>>>> [2] https://github.com/apache/arrow-rs/pull/7408 >> >>>>>>>>>>>>>>> [3] >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >> >> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder >> >>>>>>>>>>>>>>> [4] >> >> https://github.com/apache/parquet-java/pull/3191 >> >>>>>>>>>>>>>>> [5] >> >>>> https://github.com/apache/parquet-format/pull/196 >> >>>>>>>>>>>>>>> [6] >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >> >> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077 >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>> Gang >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis < >> >>>>>>> [email protected] >> >>>>>>>>> >> >>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Dear contributors, >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> My PR has now gathered comments for a week and >> >> the >> >>>>> gist >> >>>>>>> of >> >>>>>>>>> all >> >>>>>>>>>>> open >> >>>>>>>>>>>>>>> issues >> >>>>>>>>>>>>>>>> is the question of how to encode pages/column >> >>>> chunks >> >>>>>> that >> >>>>>>>>>> contain >> >>>>>>>>>>>>> only >> >>>>>>>>>>>>>>>> NaNs. There are different suggestions and I >> >> don't >> >>>> see >> >>>>>> one >> >>>>>>>>>> common >> >>>>>>>>>>>>>> favorite >> >>>>>>>>>>>>>>>> yet. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> I have outlined three alternatives of how we >> >> can >> >>>>> handle >> >>>>>>>> these >> >>>>>>>>>>> and I >> >>>>>>>>>>>>>> want >> >>>>>>>>>>>>>>> us >> >>>>>>>>>>>>>>>> to reach a conclusion here, so I can update my >> >> PR >> >>>>>>>> accordingly >> >>>>>>>>>> and >> >>>>>>>>>>>>> move >> >>>>>>>>>>>>>> on >> >>>>>>>>>>>>>>>> with it. As this is my first contribution to >> >>>>> parquet, I >> >>>>>>>> don't >> >>>>>>>>>>> know >> >>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>> decision processes here. Do we vote? Is there a >> >>>>> single >> >>>>>> or >> >>>>>>>>> group >> >>>>>>>>>>> of >> >>>>>>>>>>>>>>> decision >> >>>>>>>>>>>>>>>> makers? *Please let me know how to come to a >> >>>>> conclusion >> >>>>>>>> here; >> >>>>>>>>>>> what >> >>>>>>>>>>>>> are >> >>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>> next steps?* >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> For reference, here are the three alternatives >> >> I >> >>>>>> pointed >> >>>>>>>> out. >> >>>>>>>>>> You >> >>>>>>>>>>>> can >> >>>>>>>>>>>>>>> find >> >>>>>>>>>>>>>>>> detailed description of their PROs and CONs in >> >> my >> >>>>>>> comment: >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >> >> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> 1. My initial proposal, i.e., encoding only-NaN >> >>>> pages >> >>>>>> by >> >>>>>>>>>>>> min=max=NaN. >> >>>>>>>>>>>>>>>> 2. Adding `num_values` to the ColumnIndex, to >> >> make >> >>>> it >> >>>>>>>>> symmetric >> >>>>>>>>>>>> with >> >>>>>>>>>>>>>>>> Statistics in pages & `ColumnMetaData` and to >> >>>> enable >> >>>>>> the >> >>>>>>>>>>>> computation >> >>>>>>>>>>>>>>>> `num_values - null_count - nan_count == 0` >> >>>>>>>>>>>>>>>> 3. Adding a `nan_pages` bool list to the column >> >>>>> index, >> >>>>>>>> which >> >>>>>>>>>>>>> indicates >> >>>>>>>>>>>>>>>> whether a page contains only NaNs >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Cheers >> >>>>>>>>>>>>>>>> Jan Finis >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >> >> >> >>
