We could also consider treating NaN similar to NULL and having a separate piece of information with a count of NaN values (or just a bit indicating presence/absence of NaN). I'm not sure if that is easier or harder to implement than a total order.
On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal <laszlo.g...@cloudera.com> wrote: > To supply some context: Impala has had a number of issues > <https://issues.apache.org/jira/issues/?jql=project% > 3Dimpala%20and%20summary%20%20~%20NaN> > around NaN/infinity: > > The closest precedent related to the current issue seems to be IMPALA-6295 > <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent > handling > of 'nan' and 'inf' with min/max analytic fns": the discussion there offers > notable points on: > 1. How Impala handles similar problems in different (but related) areas, > 2. How other database products (Hive, PostgeSQL, etc.) handle similar > issues around NaNs/infinity (or infinities, in the case of IEEE-754). > > Thanks, > > - LaszloG > > > On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <z...@cloudera.com> wrote: > > > Dear Parquet and Impala Developers, > > > > We have exposed min/max statistics to extensive compatibility testing and > > found troubling inconsistencies regarding float and double values. Under > > certain (fortunately rather extreme) circumstances, this can lead to > > predicate pushdown incorrectly discarding row groups that contain > matching > > rows. > > > > The root of the problem seems to be that Impala (and probably parquet-cpp > > as well) uses C++ comparison operators for floating point numbers and > those > > do not provide a total ordering. This is actually in line with IEEE 754, > > according to which -0 is neither less nor more than +0 and comparing NaN > to > > anything always returns false. This, however is not suitable for > statistics > > and can lead to serious consequences that you can read more about in > > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>. > > > > The IEEE 754 standard and the Java API, on the other hand, both provide a > > total ordering, but I'm not sure whether the two are the same. The java > > implementation looks relatively simple - both easy to understand and > > effective to execute. You can check it here > > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/ > > classes/java/lang/Double.java#l999>. > > The IEEE 754 total ordering, on the other hand looks rather complicated > to > > the extent that I can not decide whether the Java implementation adheres > to > > it. I couldn't find the whole standard online, but I found an excerpt > about > > the totalOrder predicate here > > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have > also > > found that IEEE 754-2008 defines min and max operations as described here > > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that > > strangely *do not* adhere to a total ordering. > > > > I checked the specification in parquet-format but all I could find about > > floating point numbers is the following: > > > > * FLOAT - signed comparison of the represented value > > * DOUBLE - signed comparison of the represented value > > > > I suggest extending the specification to explicitly require > implementations > > to follow a specific comparison logic for these types. The candidates > are: > > > > - The Java implementation > > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/ > 687fd7c7986d/src/share/ > > classes/java/lang/Double.java#l999> > > which looks easy and efficient to implement in any language. > > - The IEEE 754 totalOrder <https://github.com/rust-lang/ > > rust/issues/5585> > > predicate which honestly looks rather scary. > > - The IEEE 754-2008 min and max > > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> > > operations > > which may be hard to use for comparison. > > > > I'm curious to hear your opinions. > > > > Thanks, > > > > Zoltan > > >