Dear Parquet and Impala Developers, We have exposed min/max statistics to extensive compatibility testing and found troubling inconsistencies regarding float and double values. Under certain (fortunately rather extreme) circumstances, this can lead to predicate pushdown incorrectly discarding row groups that contain matching rows.
The root of the problem seems to be that Impala (and probably parquet-cpp as well) uses C++ comparison operators for floating point numbers and those do not provide a total ordering. This is actually in line with IEEE 754, according to which -0 is neither less nor more than +0 and comparing NaN to anything always returns false. This, however is not suitable for statistics and can lead to serious consequences that you can read more about in IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>. The IEEE 754 standard and the Java API, on the other hand, both provide a total ordering, but I'm not sure whether the two are the same. The java implementation looks relatively simple - both easy to understand and effective to execute. You can check it here <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999>. The IEEE 754 total ordering, on the other hand looks rather complicated to the extent that I can not decide whether the Java implementation adheres to it. I couldn't find the whole standard online, but I found an excerpt about the totalOrder predicate here <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have also found that IEEE 754-2008 defines min and max operations as described here <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that strangely *do not* adhere to a total ordering. I checked the specification in parquet-format but all I could find about floating point numbers is the following: * FLOAT - signed comparison of the represented value * DOUBLE - signed comparison of the represented value I suggest extending the specification to explicitly require implementations to follow a specific comparison logic for these types. The candidates are: - The Java implementation <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999> which looks easy and efficient to implement in any language. - The IEEE 754 totalOrder <https://github.com/rust-lang/rust/issues/5585> predicate which honestly looks rather scary. - The IEEE 754-2008 min and max <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> operations which may be hard to use for comparison. I'm curious to hear your opinions. Thanks, Zoltan
