[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611356#comment-17611356
 ] 

Micah Kornfield commented on PARQUET-1222:
------------------------------------------

I'd propose the following "fix":
- Add a new optional bool value to the statistics  struct "contains_nan".  When 
unset, I think we specify the semantics for comparisons relative to -0.0/0.0 
and NaN, etc are not well defined and implementations have taken different 
routes.
- When set, if true, it means the column contains at least one NaN, when set to 
false it means no NaNs are present.  Further when set, it implies the following 
ordering:
NaNs are never included in Min/Max statistics in the struct.  -0.0, +0.0, are 
considered two distinct values and are ordered according to sign.

Thoughts?  Should I bring this up on the mailing list?

> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to