Re: [DISCUSS] Clarifying NaN bit preservation in floating-point encodings

Gang Wu Sun, 21 Jun 2026 22:07:02 -0700

I checked Arrow C++ and arrow-rs. Please correct me if I was wrong.

For value preservation, both implementations appear to preserve the raw
FLOAT/DOUBLE bit pattern in Parquet value encodings. PLAIN and
BYTE_STREAM_SPLIT write the raw bytes. Persisted dictionary values are also
written as raw primitive bytes, and Bloom filters hash the raw value bytes.
arrow-rs is especially explicit here: its dictionary interner compares
values by bytes so different NaN payloads can be interned separately.


For filter semantics, I did not find a common "all NaNs are equal" rule for
ordinary predicates. Arrow C++ compute comparisons use normal floating-point
operators, so `NaN == NaN` is false. arrow-rs Arrow comparison kernels use
IEEE totalOrder / bit equality, so equality distinguishes NaN payload bits.
The parquet-rs RowFilter API just evaluates a user-provided Arrow predicate
after decoding; it does not add a separate Parquet-level NaN interpretation.

Both implementations treat min/max statistics conservatively for NaNs:
NaNs are skipped/ignored for floating-point min/max, and min/max should not
be used to prove whether NaN values are present.

So my reading is that preserving raw NaN bits in Parquet values is consistent
with other implementations, but ordinary filter semantics for NaNs are not
uniform across implementations and should be clarified separately. I would
not special-case `eq(col, NaN)` as "any NaN"; if we need that behavior, it
should be expressed as an explicit `isNaN` predicate.

On Thu, Jun 18, 2026 at 4:12 PM Gábor Szádovszky <[email protected]> wrote:
>
> Hi Gang,
>
> Thanks for raising this. I agree that Parquet writers should preserve the
> exact FLOAT and DOUBLE bit pattern supplied by the application, including
> NaN sign and payload bits.
>
> However, I think we also need to clarify filter semantics for NaN values.
>
> Today, parquet-java canonicalizes non-canonical NaN bit patterns for
> physical FLOAT and DOUBLE values. This affects PLAIN, BYTE_STREAM_SPLIT,
> and dictionary encoding, which are the FLOAT/DOUBLE value encodings
> parquet-java can write today. Record-level filters also use Java comparison
> semantics, so eq(col, NaN) treats all NaN values as equal rather than
> comparing raw bits. Min/max statistics are not used for NaN values with
> TYPE_DEFINED_ORDER.
>
> Bloom filters are different: parquet-java hashes FLOAT and DOUBLE values
> using raw bits. This means existing parquet-java files may already have
> inconsistent behavior for non-canonical NaNs: the data page may contain a
> canonical NaN while the Bloom filter was built from the original raw value.
>
> After this change, parquet-java may write NaN bit patterns that older
> parquet-java versions would have canonicalized before writing. That seems
> correct for value preservation, but the expected behavior of filters should
> be specified.
>
> I see two possible directions:
>
> A. Ordinary filters do not distinguish NaN bit patterns. We could add
> explicit isNaN / isNotNaN predicates if needed.
>
> B. Ordinary filters may distinguish NaN bit patterns, but then statistics,
> Bloom, dictionary, and record-level filters all need consistent behavior,
> possibly depending on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER.
>
> How do other implementations handle this today? Do they preserve NaN
> payload bits, and do their filters treat all NaNs as equal or compare raw
> bit patterns?
>
> Cheers,
> Gabor
>
> Gang Wu <[email protected]> ezt írta (időpont: 2026. jún. 18., Cs, 6:04):
>
> > Hi all,
> >
> > While working on IEEE 754 total ordering and nan_count in parquet-java
> > [1], Gabor
> > pointed out one area that seems underspecified in the format.
> >
> > The spec describes floating-point ordering and statistics behavior, but it
> > does
> > not seem to clearly say whether FLOAT/DOUBLE value encodings should
> > preserve raw
> > NaN sign/payload bits. In parquet-java today, PLAIN, BYTE_STREAM_SPLIT, and
> > dictionary encoding canonicalize NaN values when writing encoded values.
> > DictionaryFilter also expands FLOAT/DOUBLE dictionaries into boxed Java
> > sets,
> > which collapse NaNs by Java equality/hash semantics. Bloom filters are
> > different: parquet-java already hashes FLOAT/DOUBLE values using raw bits.
> >
> > The PR changes PLAIN, BYTE_STREAM_SPLIT, and dictionary encoding to
> > preserve raw
> > NaN bits. My reasoning was that IEEE_754_TOTAL_ORDER distinguishes NaN bit
> > patterns, so preserving raw bits seems necessary if we want that order to
> > be
> > meaningful for encoded values, not only for statistics/comparators.
> > However,
> > this is a visible behavior change: dictionary encoding may persist
> > distinct NaN
> > payloads as distinct dictionary values instead of one canonical NaN.
> >
> > I think we should clarify a few questions:
> > - Should FLOAT/DOUBLE encodings preserve raw NaN bits?
> > - Should dictionary encoding preserve distinct NaN payloads as distinct
> >   dictionary values?
> > - Should this depend on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER?
> > - What should dictionary filters and bloom filters assume for NaN values?
> >
> > My inclination is that FLOAT/DOUBLE value encodings should always preserve
> > raw
> > NaN bits. So my PR should be regarded as a bug fix. But since this changes
> > parquet-java behavior, I wanted to discuss it explicitly.
> >
> > What do others think?
> >
> > [1] https://github.com/apache/parquet-java/pull/3393
> >
> > Best,
> > Gang
> >

Re: [DISCUSS] Clarifying NaN bit preservation in floating-point encodings

Reply via email to