Hi all, While working on IEEE 754 total ordering and nan_count in parquet-java [1], Gabor pointed out one area that seems underspecified in the format.
The spec describes floating-point ordering and statistics behavior, but it does not seem to clearly say whether FLOAT/DOUBLE value encodings should preserve raw NaN sign/payload bits. In parquet-java today, PLAIN, BYTE_STREAM_SPLIT, and dictionary encoding canonicalize NaN values when writing encoded values. DictionaryFilter also expands FLOAT/DOUBLE dictionaries into boxed Java sets, which collapse NaNs by Java equality/hash semantics. Bloom filters are different: parquet-java already hashes FLOAT/DOUBLE values using raw bits. The PR changes PLAIN, BYTE_STREAM_SPLIT, and dictionary encoding to preserve raw NaN bits. My reasoning was that IEEE_754_TOTAL_ORDER distinguishes NaN bit patterns, so preserving raw bits seems necessary if we want that order to be meaningful for encoded values, not only for statistics/comparators. However, this is a visible behavior change: dictionary encoding may persist distinct NaN payloads as distinct dictionary values instead of one canonical NaN. I think we should clarify a few questions: - Should FLOAT/DOUBLE encodings preserve raw NaN bits? - Should dictionary encoding preserve distinct NaN payloads as distinct dictionary values? - Should this depend on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER? - What should dictionary filters and bloom filters assume for NaN values? My inclination is that FLOAT/DOUBLE value encodings should always preserve raw NaN bits. So my PR should be regarded as a bug fix. But since this changes parquet-java behavior, I wanted to discuss it explicitly. What do others think? [1] https://github.com/apache/parquet-java/pull/3393 Best, Gang
