Yeah, I missed that. We set it per column, so all other types could keep TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.
On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <tarmstr...@cloudera.com> wrote: > We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we > just define a special order for floats? Essentially it would be a tag for > writers to say "hey I know about this total order thing". > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <l...@cloudera.com> wrote: > > > I think one idea behind the column order fields was that if a reader does > > not recognize a value there, it needs to ignore the stats. If I remember > > correctly, that was intended to allow us to add new orderings for > > collations, but it also seems useful to address gaps in the spec or known > > broken readers. In this case we would need to deprecate the default > > "TypeDefinedOrder" and replace it with something like > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up, > > like TypeDefinedOrderV2 and so on. > > > > An alternative would be to list all writers that are known to have > written > > incorrect stats. However that will not prevent old implementations to > > misinterpret correct stats - which I think was the main reason why we > added > > new stats fields. > > > > > > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <alex.b...@cloudera.com> > > wrote: > > > > > I hope the common cases is that data files do not contain these special > > > float values. As the simplest solution, how about writers refrain from > > > populating the stats if a special value is encountered? > > > > > > That fix does not preclude a more thorough solution in the future, but > it > > > addresses the common case quickly. > > > > > > For existing data files we could check the writer version ignore > filters > > on > > > float/double. I don't know whether min/max filtering is common on > > > float/double, but I suspect it's not. > > > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong < > tarmstr...@cloudera.com> > > > wrote: > > > > > > > There is an extensibility mechanism with the ColumnOrder union - I > > think > > > > that was meant to avoid the need to add new stat fields? > > > > > > > > Given that the bug was in the Parquet spec, we'll need to make a spec > > > > change anyway, so we could add a new ColumnOrder - > > > FloatingPointTotalOrder? > > > > at the same time as fixing the gap in the spec. > > > > > > > > It could make sense to declare that the default ordering for > > > floats/doubles > > > > is not NaN-aware (i.e. the reader should assume that NaN was > > arbitrarily > > > > ordered) and readers should either implement the required logic to > > handle > > > > that correctly (I had some ideas here: > > > > https://issues.apache.org/jira/browse/IMPALA-6527? > > > > focusedCommentId=16366106&page=com.atlassian.jira. > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106) > > > > or ignore the stats. > > > > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbap...@cloudera.com> > > wrote: > > > > > > > > > > We could have a similar problem > > > > > > with not finding +0.0 values because a -0.0 is written to the > > > max_value > > > > > > field by some component that considers them the same. > > > > > > > > > > My hope is that the filtering would behave sanely, since -0.0 == > +0.0 > > > > > under the real-number-inspired ordering, which is distinguished > from > > > > > total Ordering, and which is also what you get when you use the > > > > > default C/C++ operators <, >, <=, ==, and so on. > > > > > > > > > > You can distinguish between -0.0 and +0.0 without using total > > ordering > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other > > > > > ways to distinguish, I suspect, but that's the simplest one I > recall > > > > > at the moment. > > > > > > > > > > > > > > >