Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Tim Armstrong Fri, 16 Feb 2018 09:16:38 -0800

I don't see a major benefit to a temporary solution. The files are already
out there and we need to implement a fix on the read path regardless. If we
keep writing the stats there's at least some information contained in the
stats that readers can make use of, if they want to implement the required
logic.


Dropping stats if an NaN is encountered also doesn't really address the
other side of the problem - an absence of a NaN in the stats doesn't imply
an absence of a NaN in the data, so the reader can't do anything useful
with the stats anyway unless it's NaN-aware.

On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <alex.b...@cloudera.com>
wrote:

> I hope the common cases is that data files do not contain these special
> float values. As the simplest solution, how about writers refrain from
> populating the stats if a special value is encountered?
>
> That fix does not preclude a more thorough solution in the future, but it
> addresses the common case quickly.
>
> For existing data files we could check the writer version ignore filters on
> float/double. I don't know whether min/max filtering is common on
> float/double, but I suspect it's not.
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <tarmstr...@cloudera.com>
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was in the Parquet spec, we'll need to make a spec
> > change anyway, so we could add a new ColumnOrder -
> FloatingPointTotalOrder?
> > at the same time as fixing the gap in the spec.
> >
> > It could make sense to declare that the default ordering for
> floats/doubles
> > is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> > ordered) and readers should either implement the required logic to handle
> > that correctly (I had some ideas here:
> > https://issues.apache.org/jira/browse/IMPALA-6527?
> > focusedCommentId=16366106&page=com.atlassian.jira.
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > or ignore the stats.
> >
> > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbap...@cloudera.com> wrote:
> >
> > > > We could have a similar problem
> > > > with not finding +0.0 values because a -0.0 is written to the
> max_value
> > > > field by some component that considers them the same.
> > >
> > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > under the real-number-inspired ordering, which is distinguished from
> > > total Ordering, and which is also what you get when you use the
> > > default C/C++ operators <, >, <=, ==, and so on.
> > >
> > > You can distinguish between -0.0 and +0.0 without using total ordering
> > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > at the moment.
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Reply via email to