Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Tim Armstrong Fri, 16 Feb 2018 09:19:13 -0800

We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
just define a special order for floats? Essentially it would be a tag for
writers to say "hey I know about this total order thing".


On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <[email protected]> wrote:

> I think one idea behind the column order fields was that if a reader does
> not recognize a value there, it needs to ignore the stats. If I remember
> correctly, that was intended to allow us to add new orderings for
> collations, but it also seems useful to address gaps in the spec or known
> broken readers. In this case we would need to deprecate the default
> "TypeDefinedOrder" and replace it with something like
> "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
> like TypeDefinedOrderV2 and so on.
>
> An alternative would be to list all writers that are known to have written
> incorrect stats. However that will not prevent old implementations to
> misinterpret correct stats - which I think was the main reason why we added
> new stats fields.
>
>
>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <[email protected]>
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <[email protected]>
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <[email protected]>
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Reply via email to