Today, Impala does not evaluate "<col> != <constant>" against stats, but as
Zoltan pointed out there is a way to reasonably do that. It does not work
if we ignore NaN though, so we need to be careful.

On Tue, Feb 20, 2018 at 9:24 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:

> In parquet-mr, if you are looking for a value that is not equal to some
> reference value r and stats are min = r and max = r then that row group is
> discarded, because there can not be any other values in that row group.
>
> On Tue, Feb 20, 2018 at 6:21 PM Jim Apple <jbap...@cloudera.com> wrote:
>
> > For that predicate in particular, does Impala use stats already?
> >
> > Let's say a column contains only the intuitive notion of floats: no
> > NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
> > row-group stats are b <= $COL <= c, were a < b, we can know that the
> > whole row group can be included. The addition of NaNs doesn't change
> > that.
> >
> > OTOH, if b <= a <= c, then we have to check the whole row group, and
> > the addition of NaNs doesn't change that.
> >
> > On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <alex.b...@cloudera.com>
> > wrote:
> > > On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <z...@cloudera.com>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> Tim, I added your suggestion to introduce a new ColumnOrder to
> > PARQUET-1222
> > >> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> > >> solution.
> > >>
> > >> Alex, not writing min/max if there is a NaN is indeed a feasible
> > quick-fix,
> > >> but I think it would be better to just ignore NaN-s for the pruposes
> of
> > >> min/max stats. For reading, we can ignore stats that contain a NaN. We
> > also
> > >> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> > >> problematic, though.
> > >>
> > >
> > > I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> > > <constant> that would evaluate to true against NaN. We cannot reliable
> > use
> > > stats for such a predicate.
> > >
> > >
> > >>
> > >> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> > >> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me:
> > "This
> > >> function is not required to be sensitive to the sign of zero, although
> > some
> > >> implementations additionally enforce that if one argument is +0 and
> the
> > >> other is -0, then +0 is returned." [1
> > >> <http://en.cppreference.com/w/c/numeric/math/fmax>]
> > >>
> > >> Br,
> > >>
> > >> Zoltan
> > >>
> > >>
> > >>
> > >> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jbap...@cloudera.com>
> wrote:
> > >>
> > >> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > >> > <borokna...@cloudera.com> wrote:
> > >> > > I would just like to mention that the fmax() / fmin() functions in
> > >> C/C++
> > >> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > >> > > specification:
> > >> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >> > >
> > >> > > I think this behavior is also the most intuitive and useful
> > regarding
> > >> to
> > >> > > statistics. If we want to select the max value, I think it's
> > reasonable
> > >> > to
> > >> > > ignore nulls and not-numbers.
> > >> >
> > >> > It should be noted that this is different than the total ordering
> > >> > predicate. With that predicate, -NaN < -inf < negative numbers <
> -0.0
> > >> > < +0.0 < positive numbers < +inf < +NaN
> > >> >
> > >> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > >> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > >> > 100% sure on that.
> > >> >
> > >>
> >
>

Reply via email to