Hi Gang, Thanks for raising this. I agree that Parquet writers should preserve the exact FLOAT and DOUBLE bit pattern supplied by the application, including NaN sign and payload bits.
However, I think we also need to clarify filter semantics for NaN values. Today, parquet-java canonicalizes non-canonical NaN bit patterns for physical FLOAT and DOUBLE values. This affects PLAIN, BYTE_STREAM_SPLIT, and dictionary encoding, which are the FLOAT/DOUBLE value encodings parquet-java can write today. Record-level filters also use Java comparison semantics, so eq(col, NaN) treats all NaN values as equal rather than comparing raw bits. Min/max statistics are not used for NaN values with TYPE_DEFINED_ORDER. Bloom filters are different: parquet-java hashes FLOAT and DOUBLE values using raw bits. This means existing parquet-java files may already have inconsistent behavior for non-canonical NaNs: the data page may contain a canonical NaN while the Bloom filter was built from the original raw value. After this change, parquet-java may write NaN bit patterns that older parquet-java versions would have canonicalized before writing. That seems correct for value preservation, but the expected behavior of filters should be specified. I see two possible directions: A. Ordinary filters do not distinguish NaN bit patterns. We could add explicit isNaN / isNotNaN predicates if needed. B. Ordinary filters may distinguish NaN bit patterns, but then statistics, Bloom, dictionary, and record-level filters all need consistent behavior, possibly depending on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER. How do other implementations handle this today? Do they preserve NaN payload bits, and do their filters treat all NaNs as equal or compare raw bit patterns? Cheers, Gabor Gang Wu <[email protected]> ezt írta (időpont: 2026. jún. 18., Cs, 6:04): > Hi all, > > While working on IEEE 754 total ordering and nan_count in parquet-java > [1], Gabor > pointed out one area that seems underspecified in the format. > > The spec describes floating-point ordering and statistics behavior, but it > does > not seem to clearly say whether FLOAT/DOUBLE value encodings should > preserve raw > NaN sign/payload bits. In parquet-java today, PLAIN, BYTE_STREAM_SPLIT, and > dictionary encoding canonicalize NaN values when writing encoded values. > DictionaryFilter also expands FLOAT/DOUBLE dictionaries into boxed Java > sets, > which collapse NaNs by Java equality/hash semantics. Bloom filters are > different: parquet-java already hashes FLOAT/DOUBLE values using raw bits. > > The PR changes PLAIN, BYTE_STREAM_SPLIT, and dictionary encoding to > preserve raw > NaN bits. My reasoning was that IEEE_754_TOTAL_ORDER distinguishes NaN bit > patterns, so preserving raw bits seems necessary if we want that order to > be > meaningful for encoded values, not only for statistics/comparators. > However, > this is a visible behavior change: dictionary encoding may persist > distinct NaN > payloads as distinct dictionary values instead of one canonical NaN. > > I think we should clarify a few questions: > - Should FLOAT/DOUBLE encodings preserve raw NaN bits? > - Should dictionary encoding preserve distinct NaN payloads as distinct > dictionary values? > - Should this depend on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER? > - What should dictionary filters and bloom filters assume for NaN values? > > My inclination is that FLOAT/DOUBLE value encodings should always preserve > raw > NaN bits. So my PR should be regarded as a bug fix. But since this changes > parquet-java behavior, I wanted to discuss it explicitly. > > What do others think? > > [1] https://github.com/apache/parquet-java/pull/3393 > > Best, > Gang >
