On Fri, Feb 16, 2018 at 9:38 AM, Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>

Good point. I agree that summarizes the issues.

It is basically treating this issue like a bug in the writer (arguable). I
don't think we should require a rev in the Parquet spec to address bugs in
writers.
Working around bugs will always require checking the writer version in all
readers.

I'm certainly on board with the cleaner solution to adjust the spec - but
writers will always have bugs.

>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <l...@cloudera.com> wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <tarmstr...@cloudera.com>
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <l...@cloudera.com> wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.b...@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstr...@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > >
> > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbap...@cloudera.com
> >
> > > > wrote:
> > > > > >
> > > > > > > > We could have a similar problem
> > > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > > max_value
> > > > > > > > field by some component that considers them the same.
> > > > > > >
> > > > > > > My hope is that the filtering would behave sanely, since -0.0
> ==
> > > +0.0
> > > > > > > under the real-number-inspired ordering, which is distinguished
> > > from
> > > > > > > total Ordering, and which is also what you get when you use the
> > > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > > >
> > > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > > ordering
> > > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> > other
> > > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > > recall
> > > > > > > at the moment.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to