Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Jim Apple
On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
 wrote:
> I would just like to mention that the fmax() / fmin() functions in C/C++
> Math library follow the aforementioned IEEE 754-2008 min and max
> specification:
> http://en.cppreference.com/w/c/numeric/math/fmax
>
> I think this behavior is also the most intuitive and useful regarding to
> statistics. If we want to select the max value, I think it's reasonable to
> ignore nulls and not-numbers.

It should be noted that this is different than the total ordering
predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
< +0.0 < positive numbers < +inf < +NaN

fmax appears to be closest to IEEE-754's maxNum, but not quite
matching for some corner cases (-0.0, signalling NaN), but I'm not
100% sure on that.


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
On Fri, Feb 16, 2018 at 9:38 AM, Tim Armstrong 
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>

Good point. I agree that summarizes the issues.

It is basically treating this issue like a bug in the writer (arguable). I
don't think we should require a rev in the Parquet spec to address bugs in
writers.
Working around bugs will always require checking the writer version in all
readers.

I'm certainly on board with the cleaner solution to adjust the spec - but
writers will always have bugs.

>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker  wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong 
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker  wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.b...@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstr...@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > >
> > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple  >
> > > > wrote:
> > > > > >
> > > > > > > > We could have a similar problem
> > > > > > > > with not finding +0.0 values because a -0.0 is written to the
> 

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Zoltan Borok-Nagy
I would just like to mention that the fmax() / fmin() functions in C/C++
Math library follow the aforementioned IEEE 754-2008 min and max
specification:
http://en.cppreference.com/w/c/numeric/math/fmax

I think this behavior is also the most intuitive and useful regarding to
statistics. If we want to select the max value, I think it's reasonable to
ignore nulls and not-numbers.
I think it serves well the common case as well. E.g. there are a lot of
numbers, and some NaNs, I don't know if we want to ruin our upper bound by
choosing NaN as max.

And if there's still a NaN by an old writer, we can treat is as Inf.

Just my two cents.


On Fri, Feb 16, 2018 at 6:38 PM, Tim Armstrong 
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker  wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong 
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker  wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.b...@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstr...@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > 

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
The reader still can't correctly interpret those stats without knowing
about the behaviour of that specific writer though, because it can't assume
the absence of NaNs unless it knows that they are reading a file written by
a writer that drops stats when it sees NaNs.

It *could* fix the behaviour of some naive readers that don't correctly
handle the current ambiguity in the specification, but I think those need
to be fixed anyway because they will return wrong results for existing
files.

In the process of fixing the readers, you could then modify the readers so
that they are aware of this special writer that drops stats with NaNs and
knows that it is safe to use them, but I think those kind of shared
reader-writer assumptions are essentially like having an unofficial
extension of the Parquet spec.

On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker  wrote:

> Yeah, I missed that. We set it per column, so all other types could keep
> TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.
>
> On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong 
> wrote:
>
> > We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't
> we
> > just define a special order for floats? Essentially it would be a tag for
> > writers to say "hey I know about this total order thing".
> >
> > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker  wrote:
> >
> > > I think one idea behind the column order fields was that if a reader
> does
> > > not recognize a value there, it needs to ignore the stats. If I
> remember
> > > correctly, that was intended to allow us to add new orderings for
> > > collations, but it also seems useful to address gaps in the spec or
> known
> > > broken readers. In this case we would need to deprecate the default
> > > "TypeDefinedOrder" and replace it with something like
> > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> up,
> > > like TypeDefinedOrderV2 and so on.
> > >
> > > An alternative would be to list all writers that are known to have
> > written
> > > incorrect stats. However that will not prevent old implementations to
> > > misinterpret correct stats - which I think was the main reason why we
> > added
> > > new stats fields.
> > >
> > >
> > >
> > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> alex.b...@cloudera.com>
> > > wrote:
> > >
> > > > I hope the common cases is that data files do not contain these
> special
> > > > float values. As the simplest solution, how about writers refrain
> from
> > > > populating the stats if a special value is encountered?
> > > >
> > > > That fix does not preclude a more thorough solution in the future,
> but
> > it
> > > > addresses the common case quickly.
> > > >
> > > > For existing data files we could check the writer version ignore
> > filters
> > > on
> > > > float/double. I don't know whether min/max filtering is common on
> > > > float/double, but I suspect it's not.
> > > >
> > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > tarmstr...@cloudera.com>
> > > > wrote:
> > > >
> > > > > There is an extensibility mechanism with the ColumnOrder union - I
> > > think
> > > > > that was meant to avoid the need to add new stat fields?
> > > > >
> > > > > Given that the bug was in the Parquet spec, we'll need to make a
> spec
> > > > > change anyway, so we could add a new ColumnOrder -
> > > > FloatingPointTotalOrder?
> > > > > at the same time as fixing the gap in the spec.
> > > > >
> > > > > It could make sense to declare that the default ordering for
> > > > floats/doubles
> > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > arbitrarily
> > > > > ordered) and readers should either implement the required logic to
> > > handle
> > > > > that correctly (I had some ideas here:
> > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > focusedCommentId=16366106=com.atlassian.jira.
> > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > > > or ignore the stats.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple 
> > > wrote:
> > > > >
> > > > > > > We could have a similar problem
> > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > max_value
> > > > > > > field by some component that considers them the same.
> > > > > >
> > > > > > My hope is that the filtering would behave sanely, since -0.0 ==
> > +0.0
> > > > > > under the real-number-inspired ordering, which is distinguished
> > from
> > > > > > total Ordering, and which is also what you get when you use the
> > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > >
> > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > ordering
> > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> other
> > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > recall
> > > > > > at the moment.
> > > > 

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Lars Volker
Yeah, I missed that. We set it per column, so all other types could keep
TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.

On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong 
wrote:

> We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
> just define a special order for floats? Essentially it would be a tag for
> writers to say "hey I know about this total order thing".
>
> On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker  wrote:
>
> > I think one idea behind the column order fields was that if a reader does
> > not recognize a value there, it needs to ignore the stats. If I remember
> > correctly, that was intended to allow us to add new orderings for
> > collations, but it also seems useful to address gaps in the spec or known
> > broken readers. In this case we would need to deprecate the default
> > "TypeDefinedOrder" and replace it with something like
> > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
> > like TypeDefinedOrderV2 and so on.
> >
> > An alternative would be to list all writers that are known to have
> written
> > incorrect stats. However that will not prevent old implementations to
> > misinterpret correct stats - which I think was the main reason why we
> added
> > new stats fields.
> >
> >
> >
> > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm 
> > wrote:
> >
> > > I hope the common cases is that data files do not contain these special
> > > float values. As the simplest solution, how about writers refrain from
> > > populating the stats if a special value is encountered?
> > >
> > > That fix does not preclude a more thorough solution in the future, but
> it
> > > addresses the common case quickly.
> > >
> > > For existing data files we could check the writer version ignore
> filters
> > on
> > > float/double. I don't know whether min/max filtering is common on
> > > float/double, but I suspect it's not.
> > >
> > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> tarmstr...@cloudera.com>
> > > wrote:
> > >
> > > > There is an extensibility mechanism with the ColumnOrder union - I
> > think
> > > > that was meant to avoid the need to add new stat fields?
> > > >
> > > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > > change anyway, so we could add a new ColumnOrder -
> > > FloatingPointTotalOrder?
> > > > at the same time as fixing the gap in the spec.
> > > >
> > > > It could make sense to declare that the default ordering for
> > > floats/doubles
> > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > arbitrarily
> > > > ordered) and readers should either implement the required logic to
> > handle
> > > > that correctly (I had some ideas here:
> > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > focusedCommentId=16366106=com.atlassian.jira.
> > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > > or ignore the stats.
> > > >
> > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple 
> > wrote:
> > > >
> > > > > > We could have a similar problem
> > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > max_value
> > > > > > field by some component that considers them the same.
> > > > >
> > > > > My hope is that the filtering would behave sanely, since -0.0 ==
> +0.0
> > > > > under the real-number-inspired ordering, which is distinguished
> from
> > > > > total Ordering, and which is also what you get when you use the
> > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > >
> > > > > You can distinguish between -0.0 and +0.0 without using total
> > ordering
> > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > > ways to distinguish, I suspect, but that's the simplest one I
> recall
> > > > > at the moment.
> > > > >
> > > >
> > >
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
On Fri, Feb 16, 2018 at 9:15 AM, Tim Armstrong 
wrote:

> I don't see a major benefit to a temporary solution. The files are already
> out there and we need to implement a fix on the read path regardless. If we
> keep writing the stats there's at least some information contained in the
> stats that readers can make use of, if they want to implement the required
> logic.
>
> Dropping stats if an NaN is encountered also doesn't really address the
> other side of the problem - an absence of a NaN in the stats doesn't imply
> an absence of a NaN in the data, so the reader can't do anything useful
> with the stats anyway unless it's NaN-aware.
>

The writer solution is to only write stats if the data does not contain
special values (common case).

>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm 
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong 
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple 
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
just define a special order for floats? Essentially it would be a tag for
writers to say "hey I know about this total order thing".

On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker  wrote:

> I think one idea behind the column order fields was that if a reader does
> not recognize a value there, it needs to ignore the stats. If I remember
> correctly, that was intended to allow us to add new orderings for
> collations, but it also seems useful to address gaps in the spec or known
> broken readers. In this case we would need to deprecate the default
> "TypeDefinedOrder" and replace it with something like
> "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
> like TypeDefinedOrderV2 and so on.
>
> An alternative would be to list all writers that are known to have written
> incorrect stats. However that will not prevent old implementations to
> misinterpret correct stats - which I think was the main reason why we added
> new stats fields.
>
>
>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm 
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong 
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple 
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
I don't see a major benefit to a temporary solution. The files are already
out there and we need to implement a fix on the read path regardless. If we
keep writing the stats there's at least some information contained in the
stats that readers can make use of, if they want to implement the required
logic.

Dropping stats if an NaN is encountered also doesn't really address the
other side of the problem - an absence of a NaN in the stats doesn't imply
an absence of a NaN in the data, so the reader can't do anything useful
with the stats anyway unless it's NaN-aware.

On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm 
wrote:

> I hope the common cases is that data files do not contain these special
> float values. As the simplest solution, how about writers refrain from
> populating the stats if a special value is encountered?
>
> That fix does not preclude a more thorough solution in the future, but it
> addresses the common case quickly.
>
> For existing data files we could check the writer version ignore filters on
> float/double. I don't know whether min/max filtering is common on
> float/double, but I suspect it's not.
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong 
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was in the Parquet spec, we'll need to make a spec
> > change anyway, so we could add a new ColumnOrder -
> FloatingPointTotalOrder?
> > at the same time as fixing the gap in the spec.
> >
> > It could make sense to declare that the default ordering for
> floats/doubles
> > is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> > ordered) and readers should either implement the required logic to handle
> > that correctly (I had some ideas here:
> > https://issues.apache.org/jira/browse/IMPALA-6527?
> > focusedCommentId=16366106=com.atlassian.jira.
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > or ignore the stats.
> >
> > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple  wrote:
> >
> > > > We could have a similar problem
> > > > with not finding +0.0 values because a -0.0 is written to the
> max_value
> > > > field by some component that considers them the same.
> > >
> > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > under the real-number-inspired ordering, which is distinguished from
> > > total Ordering, and which is also what you get when you use the
> > > default C/C++ operators <, >, <=, ==, and so on.
> > >
> > > You can distinguish between -0.0 and +0.0 without using total ordering
> > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > at the moment.
> > >
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Lars Volker
I think one idea behind the column order fields was that if a reader does
not recognize a value there, it needs to ignore the stats. If I remember
correctly, that was intended to allow us to add new orderings for
collations, but it also seems useful to address gaps in the spec or known
broken readers. In this case we would need to deprecate the default
"TypeDefinedOrder" and replace it with something like
"TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
like TypeDefinedOrderV2 and so on.

An alternative would be to list all writers that are known to have written
incorrect stats. However that will not prevent old implementations to
misinterpret correct stats - which I think was the main reason why we added
new stats fields.



On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm 
wrote:

> I hope the common cases is that data files do not contain these special
> float values. As the simplest solution, how about writers refrain from
> populating the stats if a special value is encountered?
>
> That fix does not preclude a more thorough solution in the future, but it
> addresses the common case quickly.
>
> For existing data files we could check the writer version ignore filters on
> float/double. I don't know whether min/max filtering is common on
> float/double, but I suspect it's not.
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong 
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was in the Parquet spec, we'll need to make a spec
> > change anyway, so we could add a new ColumnOrder -
> FloatingPointTotalOrder?
> > at the same time as fixing the gap in the spec.
> >
> > It could make sense to declare that the default ordering for
> floats/doubles
> > is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> > ordered) and readers should either implement the required logic to handle
> > that correctly (I had some ideas here:
> > https://issues.apache.org/jira/browse/IMPALA-6527?
> > focusedCommentId=16366106=com.atlassian.jira.
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > or ignore the stats.
> >
> > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple  wrote:
> >
> > > > We could have a similar problem
> > > > with not finding +0.0 values because a -0.0 is written to the
> max_value
> > > > field by some component that considers them the same.
> > >
> > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > under the real-number-inspired ordering, which is distinguished from
> > > total Ordering, and which is also what you get when you use the
> > > default C/C++ operators <, >, <=, ==, and so on.
> > >
> > > You can distinguish between -0.0 and +0.0 without using total ordering
> > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > at the moment.
> > >
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Alexander Behm
I hope the common cases is that data files do not contain these special
float values. As the simplest solution, how about writers refrain from
populating the stats if a special value is encountered?

That fix does not preclude a more thorough solution in the future, but it
addresses the common case quickly.

For existing data files we could check the writer version ignore filters on
float/double. I don't know whether min/max filtering is common on
float/double, but I suspect it's not.

On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong 
wrote:

> There is an extensibility mechanism with the ColumnOrder union - I think
> that was meant to avoid the need to add new stat fields?
>
> Given that the bug was in the Parquet spec, we'll need to make a spec
> change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
> at the same time as fixing the gap in the spec.
>
> It could make sense to declare that the default ordering for floats/doubles
> is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> ordered) and readers should either implement the required logic to handle
> that correctly (I had some ideas here:
> https://issues.apache.org/jira/browse/IMPALA-6527?
> focusedCommentId=16366106=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> or ignore the stats.
>
> On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple  wrote:
>
> > > We could have a similar problem
> > > with not finding +0.0 values because a -0.0 is written to the max_value
> > > field by some component that considers them the same.
> >
> > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > under the real-number-inspired ordering, which is distinguished from
> > total Ordering, and which is also what you get when you use the
> > default C/C++ operators <, >, <=, ==, and so on.
> >
> > You can distinguish between -0.0 and +0.0 without using total ordering
> > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > ways to distinguish, I suspect, but that's the simplest one I recall
> > at the moment.
> >
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
There is an extensibility mechanism with the ColumnOrder union - I think
that was meant to avoid the need to add new stat fields?

Given that the bug was in the Parquet spec, we'll need to make a spec
change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
at the same time as fixing the gap in the spec.

It could make sense to declare that the default ordering for floats/doubles
is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
ordered) and readers should either implement the required logic to handle
that correctly (I had some ideas here:
https://issues.apache.org/jira/browse/IMPALA-6527?focusedCommentId=16366106=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
or ignore the stats.

On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple  wrote:

> > We could have a similar problem
> > with not finding +0.0 values because a -0.0 is written to the max_value
> > field by some component that considers them the same.
>
> My hope is that the filtering would behave sanely, since -0.0 == +0.0
> under the real-number-inspired ordering, which is distinguished from
> total Ordering, and which is also what you get when you use the
> default C/C++ operators <, >, <=, ==, and so on.
>
> You can distinguish between -0.0 and +0.0 without using total ordering
> by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> ways to distinguish, I suspect, but that's the simplest one I recall
> at the moment.
>


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Jim Apple
> We could have a similar problem
> with not finding +0.0 values because a -0.0 is written to the max_value
> field by some component that considers them the same.

My hope is that the filtering would behave sanely, since -0.0 == +0.0
under the real-number-inspired ordering, which is distinguished from
total Ordering, and which is also what you get when you use the
default C/C++ operators <, >, <=, ==, and so on.

You can distinguish between -0.0 and +0.0 without using total ordering
by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
ways to distinguish, I suspect, but that's the simplest one I recall
at the moment.


Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Zoltan Ivanfi
Hi,

I don't think it would be worth to keep a separate NaN count, but we could
ignore them when calculating min/max stats regardless. However, NaN is not
the only value preventing total ordering. We could have a similar problem
with not finding +0.0 values because a -0.0 is written to the max_value
field by some component that considers them the same.

One more thing to consider is how to deal with existing data when we update
the specs. One possibility is to specify legacy rules, like "if the stats
contain NaN and the file was written by Impala, it should be ignored", but
that would complicate the specs and be a burden on implementors. In fact,
min_value and max_value were introduced because we did not want to define
similar legacy rules for min and max. So should we deprecate min_value and
max_value as well and introduce yet_another_min and yet_another_max fields
instead (with nicer names, naturally)?

Br,

Zoltan

On Thu, Feb 15, 2018 at 8:01 PM Tim Armstrong 
wrote:

> We could also consider treating NaN similar to NULL and having a separate
> piece of information with a count of NaN values (or just a bit indicating
> presence/absence of NaN). I'm not sure if that is easier or harder to
> implement than a total order.
>
> On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal 
> wrote:
>
> > To supply some context: Impala has had a number of issues
> >  > 3Dimpala%20and%20summary%20%20~%20NaN>
> > around NaN/infinity:
> >
> > The closest precedent related to the current issue seems to be
> IMPALA-6295
> > : "Inconsistent
> > handling
> > of 'nan' and 'inf' with min/max analytic fns": the discussion there
> offers
> > notable points on:
> > 1. How Impala handles similar problems in different (but related) areas,
> > 2. How other database products (Hive, PostgeSQL, etc.) handle similar
> > issues around NaNs/infinity (or infinities, in the case of IEEE-754).
> >
> > Thanks,
> >
> > - LaszloG
> >
> >
> > On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi  wrote:
> >
> > > Dear Parquet and Impala Developers,
> > >
> > > We have exposed min/max statistics to extensive compatibility testing
> and
> > > found troubling inconsistencies regarding float and double values.
> Under
> > > certain (fortunately rather extreme) circumstances, this can lead to
> > > predicate pushdown incorrectly discarding row groups that contain
> > matching
> > > rows.
> > >
> > > The root of the problem seems to be that Impala (and probably
> parquet-cpp
> > > as well) uses C++ comparison operators for floating point numbers and
> > those
> > > do not provide a total ordering. This is actually in line with IEEE
> 754,
> > > according to which -0 is neither less nor more than +0 and comparing
> NaN
> > to
> > > anything always returns false. This, however is not suitable for
> > statistics
> > > and can lead to serious consequences that you can read more about in
> > > IMPALA-6527 .
> > >
> > > The IEEE 754 standard and the Java API, on the other hand, both
> provide a
> > > total ordering, but I'm not sure whether the two are the same. The java
> > > implementation looks relatively simple - both easy to understand and
> > > effective to execute. You can check it here
> > >  > > classes/java/lang/Double.java#l999>.
> > > The IEEE 754 total ordering, on the other hand looks rather complicated
> > to
> > > the extent that I can not decide whether the Java implementation
> adheres
> > to
> > > it. I couldn't find the whole standard online, but I found an excerpt
> > about
> > > the totalOrder predicate here
> > > . Additionally, I have
> > also
> > > found that IEEE 754-2008 defines min and max operations as described
> here
> > >  that
> > > strangely *do not* adhere to a total ordering.
> > >
> > > I checked the specification in parquet-format but all I could find
> about
> > > floating point numbers is the following:
> > >
> > >*   FLOAT - signed comparison of the represented value
> > >*   DOUBLE - signed comparison of the represented value
> > >
> > > I suggest extending the specification to explicitly require
> > implementations
> > > to follow a specific comparison logic for these types. The candidates
> > are:
> > >
> > >- The Java implementation
> > > > 687fd7c7986d/src/share/
> > > classes/java/lang/Double.java#l999>
> > >which looks easy and efficient to implement in any language.
> > >- The IEEE 754 totalOrder  > > rust/issues/5585>
> > >predicate which honestly looks rather scary.
> > >- The IEEE 754-2008 min and max
> >