Thanks Jan for your endless effort on this! I'm in favor of simplicity and generalism. I think we have already debated a lot for `nan_count` in [1] and [2] is the reflection of those discussions. Therefore I am inclined to start a vote for [2] unless there is a significantly better proposal.
I would suggest everyone interested in this discussion to attend the scheduled sync on Aug 6th (detailed below) to spread the word to the broader community. If we can get a consensus on [2], I can help start the vote and move forward. *Apache Parquet Community Sync Wednesday, August 6 · 10:00 – 11:00am * *Time zone: America/Los_Angeles* *Google Meet joining info Video call link: https://meet.google.com/bhe-rvan-qjk <https://meet.google.com/bhe-rvan-qjk> * [1] https://github.com/apache/parquet-format/pull/196 [2] https://github.com/apache/parquet-format/pull/221 Best, Gang On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <[email protected]> wrote: > Hi Gijs, > > Thank you for bringing up concrete points, I'm happy to discuss them in > detail. > > NaNs are less common in the SQL world than in the DataFrame world where > > NaNs were used for a long time to represent missing values. > > > You could transcode between NULL to NaN before reading and writing to > Parquet. You basically mention yourself that NaNs were used for missing > values, i.e., what is commonly a NULL, which wasn't available. So, > semantically, transcoding to NULL would even be the sane thing to do. Yes, > that will cost you some cycles, but should be a rather lightweight > operation in comparison to most other operations, so I would argue that it > won't totally ruin your performance. Similarly, why should Parquet play > along with a "hack" that was done in other frameworks due to shortcomings > of those frameworks? So from a philosophical point of view, I think > supporting NaNs better is the wrong thing to do. Rather, we should be a > forcing function to align others to better behavior, so appling a bit of > force might in the long run make people use NULLs also in DataFrames. > > Of course, your argument also goes into the direction of pragmatism: If a > large part of the data science world uses NaNs to encode missing values, > then maybe Parquet should accept this de-facto standard rather than > fighting it. That is indeed a valid point. The weight of it is debatable > and my personal conclusion is that it's still not worth it, as you can > transcode between NULLs and NaNs, but I do agree with its validity. > > > Since the proposal phrases it as a goal to work "regardless of how they > > order NaN w.r.t. other values" this statement feels out-of-place to me. > > Most hardware and most people don't care about total ordering and needing > > to take it into account while filtering using statistics seems like > > preferring the special case instead of the common case. Almost noone > > filters for specific NaN value bit-patterns. SQL engines that don't have > > IEEE total ordering as their default ordering for floats will also need > to > > do more special handling for this. > > > I disagree with the conclusion this statement draws. The current behavior, > and nan_counts without total ordering, pose a real problem here, even for > engines that don't care about bit patterns. I do agree that most database > engines, including the one I'm working on, do not care about bit patterns > and/or sign bits. However, how can our database engine know whether the > writer of a Parquet file saw it the same way? It can't. Therefore, it > cannot know whether a writer, for example, ordered NaNs before or after all > other numbers, or maybe ordered them by sign bit. So, if our database > engine now sees a float column in sorting columns, it cannot apply any > optimization without a lot of special casing, as it doesn't know whether > NaNs will be before all other values, after all other values, or maybe > both, depending on sign bit. It could apply contrived logic that tries to > infer where NaNs were placed from the NaN counts of the first and last > page, but doing so will be a lot of ugly code that also feels to be in the > wrong place. I.e., I don't want to need to load pages or the page index, > just to reason about a sort order. > > SQL engines that don't have > > IEEE total ordering as their default ordering for floats will also need > to > > do more special handling for this. > > > This code, which I would indeed need to write for our engine, is comparably > trivial. Simply choose the largest possible bit pattern as comparison for > upper bounds filtering for NaN, and the smallest possible bit pattern for > lower bounds. It's not more than a few lines of code that check whether a > filter is NaN and then replace its value with the highest/lowest NaN bit > pattern. It is similarly trivial to the special casing I need to do with > nan_counts, and it is way more trivial than the extra code I would need to > write for sorting columns, as depicted above. > > From a Polars perspective, having a `nan_count` and defining what > > happens to the `min` and `max` statistics when a page contains only NaNs > is > > enough to allow for all predicate filtering. I think, but correct me if I > > am wrong, this is also enough for all SQL engines that don't use total > > ordering. > > > It's not fully enough, as depicted above. Sorting columns would still not > work properly. > > As for ways forward, I propose merging the `nan_count` and `sort ordering` > > proposals into one to make one proposal > > > Note that the initial reason for proposing IEEE total order was that people > in the discussion threads found nan_counts to be too complex and too much > of an undeserving special case (re-read the discussion in the initial PR > <https://github.com/apache/parquet-format/pull/196> to see the > rationales). > So merging both together would go totally against the spirit of why IEEE > total order was proposed. While it has further upsides, the main reason was > indeed to *not have* nan_counts. If now the proposal would even go to > positive and negative nan counts (i.e., even more complexity), this would > go 180 degrees into the opposite direction of why people wanted total order > in the first place. > > Cheers, > Jan > > Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn > <[email protected]>: > > > Hello Jan and others, > > > > First, let me preface by saying I am quite new here. So I apologize if > > there is some other better way to bring up these concerns. I understand > it > > is very annoying to come in at the 11th hour and start bringing up a > bunch > > of concerns, but I would also like this to be done right. A colleague of > > mine brought up some concerns and alternative approaches in the GitHub > > thread; I will file some of the concerns here as a response. > > > > > Treating NaNs so specially is giving them attention they don't deserve. > > Most data sets do not contain NaNs. If a use case really requires them > and > > needs filtering to ignore them, they can store NULL instead, or encode > them > > differently. I would prefer the average case over the special case here. > > > > NaNs are less common in the SQL world than in the DataFrame world where > > NaNs were used for a long time to represent missing values. They still > > exist with different canonical representations and different sign bits. I > > agree it might not be correct semantically, but sadly that is the world > we > > deal with. NumPy and Numba do not have missing data functionality, people > > use NaNs there, and people definitely use that in their analytical > > dataflows. Another point that was brought up in the GH discussion was > "what > > about infinity? You could argue that having infinity in statistics is > > similarly unuseful as it's too wide of a bound". I would argue that > > infinity is very different as there is no discussion on what the ordering > > or pattern of infinity is. Everyone agrees that `min(1.0, inf, -inf) == > > -inf` and each infinity only has a single bit pattern. > > > > > It gives a defined order to every bit pattern and thus yields a total > > order, mathematically speaking, which has value by itself. With NaN > counts, > > it was still undefined how different bit patterns of NaNs were supposed > to > > be ordered, whether NaN was allowed to have a sign bit, etc., risking > that > > different engines could come to different results while filtering or > > sorting values within a file. > > > > Since the proposal phrases it as a goal to work "regardless of how they > > order NaN w.r.t. other values" this statement feels out-of-place to me. > > Most hardware and most people don't care about total ordering and needing > > to take it into account while filtering using statistics seems like > > preferring the special case instead of the common case. Almost noone > > filters for specific NaN value bit-patterns. SQL engines that don't have > > IEEE total ordering as their default ordering for floats will also need > to > > do more special handling for this. > > > > I also agree with my colleague that doing an approach that is 50% of the > > way there will make the barrier to improving it to what it actually > should > > be later on much higher. > > > > As for ways forward, I propose merging the `nan_count` and `sort > ordering` > > proposals into one to make one proposal, as they are linked together, and > > moving forward with one without knowing what will happen to the other > seems > > unwise. From a Polars perspective, having a `nan_count` and defining what > > happens to the `min` and `max` statistics when a page contains only NaNs > is > > enough to allow for all predicate filtering. I think, but correct me if I > > am wrong, this is also enough for all SQL engines that don't use total > > ordering. But if you want to be impartial to the engine's floating-point > > ordering and allow engines with total ordering to do inequality filters > > when `nan_count > 0` you would need a `positive_nan_count` and a > > `negative_nan_count`. I understand the downside with Thrift complexity, > but > > introducing another sort order is also adding complexity just in a > > different place. > > > > I would really like to see this move forward, so I hope these concerns > help > > move it forward towards a solution that works for everyone. > > > > Kind regards, > > Gijs > > > > > > On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <[email protected]> > > wrote: > > > > > I would also be in favor of starting a vote > > > > > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <[email protected]> wrote: > > > > > > > As the author of both the IEEE754 total order > > > > <https://github.com/apache/parquet-format/pull/221> PR and the > earlier > > > PR > > > > that basically proposed `nan_count` > > > > <https://github.com/apache/parquet-format/pull/196>, my current vote > > > would > > > > be for IEEE754 total order. > > > > Consequently, I would like to request a formal vote for the PR > > > introducing > > > > IEEE754 total order ( > https://github.com/apache/parquet-format/pull/221 > > ), > > > > if > > > > that is possible. > > > > > > > > My Rationales: > > > > > > > > - It's conceptually simpler. It's easier to explain. It's based on > > an > > > > IEEE-standardized order predicate. > > > > - There are already multiple implementations showing feasibility. > > This > > > > will likely make the adoption quicker. > > > > - It gives a defined order to every bit pattern and thus yields a > > > total > > > > order, mathematically speaking, which has value by itself. With > NaN > > > > counts, > > > > it was still undefined how different bit patterns of NaNs were > > > supposed > > > > to > > > > be ordered, whether NaN was allowed to have a sign bit, etc., > > risking > > > > that > > > > different engines could come to different results while filtering > or > > > > sorting values within a file. > > > > - It also solves sort order completely. With nan_counts only, it > is > > > > still undefined whether nans should be sorted before or after all > > > values > > > > (or both, depending on sign bit), so any file including NaNs could > > not > > > > really leverage sort order without being ambiguous. > > > > - It's less complex in thrift. Having fields that only apply to a > > > > handful of data types is somehow weird. If every type did this, we > > > would > > > > have a plethora of non-generic fields in thrift. > > > > - Treating NaNs so specially is giving them attention they don't > > > > deserve. Most data sets do not contain NaNs. If a use case really > > > > requires > > > > them and needs filtering to ignore them, they can store NULL > > instead, > > > > or encode them differently. I would prefer the average case over > the > > > > special case here. > > > > - The majority of the people discussing this so far seem to favor > > > total > > > > order. > > > > > > > > Cheers, > > > > Jan > > > > > > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <[email protected] > >: > > > > > > > > > Hi all, > > > > > > > > > > As this discussion has been open for more than two years, I’d like > to > > > > bump > > > > > up > > > > > this thread again to update the progress and collect feedback. > > > > > > > > > > *Background* > > > > > • Today Parquet’s min/max stats and page index omit NaNs entirely. > > > > > • Engines can’t safely prune floating values because they know > > nothing > > > on > > > > > NaNs. > > > > > • Column index is disabled if any page contains only NaNs. > > > > > > > > > > There are two active proposals as below: > > > > > > > > > > *Proposal A - IEEE754TotalOrder* (from the PR [1]) > > > > > • Define a new ColumnOrder to include +0, –0 and all NaN > > bit‐patterns. > > > > > • Stats and column index store NaNs if they appear. > > > > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and > > parquet-java > > > > [4]. > > > > > • For more context of this approach, please refer to discussion in > > [5]. > > > > > > > > > > *Proposal B - add nan_count* (from a comment [6] to [1]) > > > > > • Add `nan_count` to stats and a `nan_counts` list to column index. > > > > > • For all‐NaNs cases, write NaN to min/max and use nan_count to > > > > > distinguish. > > > > > > > > > > Both solutions have pros and cons but are way better than the > status > > > quo > > > > > today. > > > > > Please share your thoughts on the two proposals above, or maybe > come > > up > > > > > with > > > > > better alternatives. We need consensus on one proposal and move > > > forward. > > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/221 > > > > > [2] https://github.com/apache/arrow-rs/pull/7408 > > > > > [3] > > > > > > > > > > > > > > > https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder > > > > > [4] https://github.com/apache/parquet-java/pull/3191 > > > > > [5] https://github.com/apache/parquet-format/pull/196 > > > > > [6] > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077 > > > > > > > > > > Best, > > > > > Gang > > > > > > > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <[email protected]> > wrote: > > > > > > > > > > > Dear contributors, > > > > > > > > > > > > My PR has now gathered comments for a week and the gist of all > open > > > > > issues > > > > > > is the question of how to encode pages/column chunks that contain > > > only > > > > > > NaNs. There are different suggestions and I don't see one common > > > > favorite > > > > > > yet. > > > > > > > > > > > > I have outlined three alternatives of how we can handle these > and I > > > > want > > > > > us > > > > > > to reach a conclusion here, so I can update my PR accordingly and > > > move > > > > on > > > > > > with it. As this is my first contribution to parquet, I don't > know > > > the > > > > > > decision processes here. Do we vote? Is there a single or group > of > > > > > decision > > > > > > makers? *Please let me know how to come to a conclusion here; > what > > > are > > > > > the > > > > > > next steps?* > > > > > > > > > > > > For reference, here are the three alternatives I pointed out. You > > can > > > > > find > > > > > > detailed description of their PROs and CONs in my comment: > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 > > > > > > > > > > > > 1. My initial proposal, i.e., encoding only-NaN pages by > > min=max=NaN. > > > > > > 2. Adding `num_values` to the ColumnIndex, to make it symmetric > > with > > > > > > Statistics in pages & `ColumnMetaData` and to enable the > > computation > > > > > > `num_values - null_count - nan_count == 0` > > > > > > 3. Adding a `nan_pages` bool list to the column index, which > > > indicates > > > > > > whether a page contains only NaNs > > > > > > > > > > > > > > > > > > Cheers > > > > > > Jan Finis > > > > > > > > > > > > > > > > > > > > >
