On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:
> Hi, > > Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222 > <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred > solution. > > Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix, > but I think it would be better to just ignore NaN-s for the pruposes of > min/max stats. For reading, we can ignore stats that contain a NaN. We also > shouldn't use stats when looking for a NaN. -0 and +0 will still be > problematic, though. > I don't think ignoring NaNs is correct. Consider a predicate <col> != <constant> that would evaluate to true against NaN. We cannot reliable use stats for such a predicate. > > Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are > implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This > function is not required to be sensitive to the sign of zero, although some > implementations additionally enforce that if one argument is +0 and the > other is -0, then +0 is returned." [1 > <http://en.cppreference.com/w/c/numeric/math/fmax>] > > Br, > > Zoltan > > > > On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jbap...@cloudera.com> wrote: > > > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy > > <borokna...@cloudera.com> wrote: > > > I would just like to mention that the fmax() / fmin() functions in > C/C++ > > > Math library follow the aforementioned IEEE 754-2008 min and max > > > specification: > > > http://en.cppreference.com/w/c/numeric/math/fmax > > > > > > I think this behavior is also the most intuitive and useful regarding > to > > > statistics. If we want to select the max value, I think it's reasonable > > to > > > ignore nulls and not-numbers. > > > > It should be noted that this is different than the total ordering > > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0 > > < +0.0 < positive numbers < +inf < +NaN > > > > fmax appears to be closest to IEEE-754's maxNum, but not quite > > matching for some corner cases (-0.0, signalling NaN), but I'm not > > 100% sure on that. > > >