Hi Gang,

Thanks for raising this. I agree that Parquet writers should preserve the
exact FLOAT and DOUBLE bit pattern supplied by the application, including
NaN sign and payload bits.

However, I think we also need to clarify filter semantics for NaN values.

Today, parquet-java canonicalizes non-canonical NaN bit patterns for
physical FLOAT and DOUBLE values. This affects PLAIN, BYTE_STREAM_SPLIT,
and dictionary encoding, which are the FLOAT/DOUBLE value encodings
parquet-java can write today. Record-level filters also use Java comparison
semantics, so eq(col, NaN) treats all NaN values as equal rather than
comparing raw bits. Min/max statistics are not used for NaN values with
TYPE_DEFINED_ORDER.

Bloom filters are different: parquet-java hashes FLOAT and DOUBLE values
using raw bits. This means existing parquet-java files may already have
inconsistent behavior for non-canonical NaNs: the data page may contain a
canonical NaN while the Bloom filter was built from the original raw value.

After this change, parquet-java may write NaN bit patterns that older
parquet-java versions would have canonicalized before writing. That seems
correct for value preservation, but the expected behavior of filters should
be specified.

I see two possible directions:

A. Ordinary filters do not distinguish NaN bit patterns. We could add
explicit isNaN / isNotNaN predicates if needed.

B. Ordinary filters may distinguish NaN bit patterns, but then statistics,
Bloom, dictionary, and record-level filters all need consistent behavior,
possibly depending on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER.

How do other implementations handle this today? Do they preserve NaN
payload bits, and do their filters treat all NaNs as equal or compare raw
bit patterns?

Cheers,
Gabor

Gang Wu <[email protected]> ezt írta (időpont: 2026. jún. 18., Cs, 6:04):

> Hi all,
>
> While working on IEEE 754 total ordering and nan_count in parquet-java
> [1], Gabor
> pointed out one area that seems underspecified in the format.
>
> The spec describes floating-point ordering and statistics behavior, but it
> does
> not seem to clearly say whether FLOAT/DOUBLE value encodings should
> preserve raw
> NaN sign/payload bits. In parquet-java today, PLAIN, BYTE_STREAM_SPLIT, and
> dictionary encoding canonicalize NaN values when writing encoded values.
> DictionaryFilter also expands FLOAT/DOUBLE dictionaries into boxed Java
> sets,
> which collapse NaNs by Java equality/hash semantics. Bloom filters are
> different: parquet-java already hashes FLOAT/DOUBLE values using raw bits.
>
> The PR changes PLAIN, BYTE_STREAM_SPLIT, and dictionary encoding to
> preserve raw
> NaN bits. My reasoning was that IEEE_754_TOTAL_ORDER distinguishes NaN bit
> patterns, so preserving raw bits seems necessary if we want that order to
> be
> meaningful for encoded values, not only for statistics/comparators.
> However,
> this is a visible behavior change: dictionary encoding may persist
> distinct NaN
> payloads as distinct dictionary values instead of one canonical NaN.
>
> I think we should clarify a few questions:
> - Should FLOAT/DOUBLE encodings preserve raw NaN bits?
> - Should dictionary encoding preserve distinct NaN payloads as distinct
>   dictionary values?
> - Should this depend on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER?
> - What should dictionary filters and bloom filters assume for NaN values?
>
> My inclination is that FLOAT/DOUBLE value encodings should always preserve
> raw
> NaN bits. So my PR should be regarded as a bug fix. But since this changes
> parquet-java behavior, I wanted to discuss it explicitly.
>
> What do others think?
>
> [1] https://github.com/apache/parquet-java/pull/3393
>
> Best,
> Gang
>

Reply via email to