I would just like to mention that the fmax() / fmin() functions in C/C++
Math library follow the aforementioned IEEE 754-2008 min and max
specification:
http://en.cppreference.com/w/c/numeric/math/fmax

I think this behavior is also the most intuitive and useful regarding to
statistics. If we want to select the max value, I think it's reasonable to
ignore nulls and not-numbers.
I think it serves well the common case as well. E.g. there are a lot of
numbers, and some NaNs, I don't know if we want to ruin our upper bound by
choosing NaN as max.

And if there's still a NaN by an old writer, we can treat is as Inf.

Just my two cents.


On Fri, Feb 16, 2018 at 6:38 PM, Tim Armstrong <[email protected]>
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <[email protected]> wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <[email protected]>
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <[email protected]> wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > >
> > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > > We could have a similar problem
> > > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > > max_value
> > > > > > > > field by some component that considers them the same.
> > > > > > >
> > > > > > > My hope is that the filtering would behave sanely, since -0.0
> ==
> > > +0.0
> > > > > > > under the real-number-inspired ordering, which is distinguished
> > > from
> > > > > > > total Ordering, and which is also what you get when you use the
> > > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > > >
> > > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > > ordering
> > > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> > other
> > > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > > recall
> > > > > > > at the moment.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to