Dear Parquet and Impala Developers,

We have exposed min/max statistics to extensive compatibility testing and
found troubling inconsistencies regarding float and double values. Under
certain (fortunately rather extreme) circumstances, this can lead to
predicate pushdown incorrectly discarding row groups that contain matching
rows.

The root of the problem seems to be that Impala (and probably parquet-cpp
as well) uses C++ comparison operators for floating point numbers and those
do not provide a total ordering. This is actually in line with IEEE 754,
according to which -0 is neither less nor more than +0 and comparing NaN to
anything always returns false. This, however is not suitable for statistics
and can lead to serious consequences that you can read more about in
IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.

The IEEE 754 standard and the Java API, on the other hand, both provide a
total ordering, but I'm not sure whether the two are the same. The java
implementation looks relatively simple - both easy to understand and
effective to execute. You can check it here
<http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999>.
The IEEE 754 total ordering, on the other hand looks rather complicated to
the extent that I can not decide whether the Java implementation adheres to
it. I couldn't find the whole standard online, but I found an excerpt about
the totalOrder predicate here
<https://github.com/rust-lang/rust/issues/5585>. Additionally, I have also
found that IEEE 754-2008 defines min and max operations as described here
<https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
strangely *do not* adhere to a total ordering.

I checked the specification in parquet-format but all I could find about
floating point numbers is the following:

   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value

I suggest extending the specification to explicitly require implementations
to follow a specific comparison logic for these types. The candidates are:

   - The Java implementation
   
<http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999>
   which looks easy and efficient to implement in any language.
   - The IEEE 754 totalOrder <https://github.com/rust-lang/rust/issues/5585>
   predicate which honestly looks rather scary.
   - The IEEE 754-2008 min and max
   <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> operations
   which may be hard to use for comparison.

I'm curious to hear your opinions.

Thanks,

Zoltan

Reply via email to