Hi all,

While working on IEEE 754 total ordering and nan_count in parquet-java
[1], Gabor
pointed out one area that seems underspecified in the format.

The spec describes floating-point ordering and statistics behavior, but it does
not seem to clearly say whether FLOAT/DOUBLE value encodings should preserve raw
NaN sign/payload bits. In parquet-java today, PLAIN, BYTE_STREAM_SPLIT, and
dictionary encoding canonicalize NaN values when writing encoded values.
DictionaryFilter also expands FLOAT/DOUBLE dictionaries into boxed Java sets,
which collapse NaNs by Java equality/hash semantics. Bloom filters are
different: parquet-java already hashes FLOAT/DOUBLE values using raw bits.

The PR changes PLAIN, BYTE_STREAM_SPLIT, and dictionary encoding to preserve raw
NaN bits. My reasoning was that IEEE_754_TOTAL_ORDER distinguishes NaN bit
patterns, so preserving raw bits seems necessary if we want that order to be
meaningful for encoded values, not only for statistics/comparators. However,
this is a visible behavior change: dictionary encoding may persist distinct NaN
payloads as distinct dictionary values instead of one canonical NaN.

I think we should clarify a few questions:
- Should FLOAT/DOUBLE encodings preserve raw NaN bits?
- Should dictionary encoding preserve distinct NaN payloads as distinct
  dictionary values?
- Should this depend on TYPE_DEFINED_ORDER vs IEEE_754_TOTAL_ORDER?
- What should dictionary filters and bloom filters assume for NaN values?

My inclination is that FLOAT/DOUBLE value encodings should always preserve raw
NaN bits. So my PR should be regarded as a bug fix. But since this changes
parquet-java behavior, I wanted to discuss it explicitly.

What do others think?

[1] https://github.com/apache/parquet-java/pull/3393

Best,
Gang

Reply via email to