I agree that users should have the capability to determine their own way,
but with Arrow going more into the direction of providing compute
building blocks (kernels), a choice must be made, e.g., when it comes to
sorting, computing a mean, etc.

Certainly -Inf and Inf are easy to put in the row, but NaN is the odd one
out. At the very left, right, between -0 and 0? There are many
possibilities, and I'm sure one could argue in depth about what makes most
sense. I think as long as the total order is explicit, users can work with
it.

I'm not sure how easy it will be to give the user choice wrt. to the
ordering. If the API allows for specifying a custom comparison function,
then the flexibility is there. Would that be possible with Gandiva?

On Wed, Sep 2, 2020 at 11:50 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Well, Inf and -Inf are already ordered ;-)  Nan is, as usual, a can of
> worms.  Ordering probably doesn't belong in the Arrow spec (which is
> only concerned with representing data, not processing it).
>
> In any case, I agree that it makes sense to handle all NaNs as equal
> when implementing comparison-based functions: we're trying to do that on
> the C++ side.  It also makes sense to choose a well-defined ordering for
> them (for example "consider all NaNs smaller than non-NaNs", or larger).
>
> In some cases, you may also want to let the user alter the behaviour.
> For example, when summing an array, you may want a NaN in the input to
> force the result to NaN (as the IEEE spec would say), or you may want
> NaNs to be ignored.  NumPy has two functions (`sum` and `nansum`) for
> these two different behaviours.
>
> Regards
>
> Antoine.
>
>
> Le 02/09/2020 à 11:40, Matthias Vallentin a écrit :
> > Would it perhaps make sense to define the total order for non-numbers
> (NaN,
> > Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the
> > behavior is the same across all languages?
> >
> > On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <andygrov...@gmail.com>
> wrote:
> >
> >> Hi Jörn,
> >>
> >> I agree with your concerns about NaN. There was a discussion about this
> in
> >> https://github.com/apache/arrow/pull/7193
> >>
> >> I will try and make time this weekend to look at the current
> implementation
> >> and your suggestions around DictionaryArray.
> >>
> >> Hopefully, other contributors that are more familiar with that code can
> >> respond as well.
> >>
> >> Thanks,
> >>
> >> Andy.
> >>
> >>
> >>
> >> On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <
> >> joern.horstm...@signavio.com>
> >> wrote:
> >>
> >>> I ran into a few issues with the rust sort kernels and would like to
> >>> discuss possible solutions.
> >>>
> >>> 1. When sorting by multiple columns (lexsort_to_indices) the Float32
> >>> and Float64 data types are not supported because the implementation
> >>> relies on the OrdArray trait. This trait is not implemented because
> >>> f64/f32 only implements PartialOrd. The sort function for a single
> >>> column (sort_to_indices) has some special logic which looks like it
> >>> wants to treats NaN the same as null, but I'm also not convinced this
> >>> is the correct way. For example postgres does the following
> >>> (
> https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT
> >> )
> >>>
> >>> "In order to allow floating-point values to be sorted and used in
> >>> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> >>> than all non-NaN values."
> >>>
> >>> I propose to do the same in an OrdArray impl for
> >>> Float64Array/Float32Array and then simplifying the sort_to_indices
> >>> function accordingly.
> >>>
> >>> 2. Sorting for dictionary encoded strings. The problem here is that
> >>> DictionaryArray does not have a generic parameter for the value type
> >>> so it is not currently possible to only implement OrdArray for string
> >>> dictionaries. Again for the single column case, the value data type
> >>> could be checked and a sort could be implemented by looking up each
> >>> key in the dictionary. An optimization could be to check the is_sorted
> >>> flag of DictionaryArray (which does not seem to be used really) and
> >>> then directly sort by the keys. For the general case I see roughly to
> >>> options
> >>>
> >>> - Somehow implement an OrdArray view of the dictionary array. This
> >>> could be easier if OrdArray did not extend Array but was a completely
> >>> separate trait.
> >>> - Change the lexicographic sort impl to not use dynamic calls but
> >>> instead sort multiple times. So for a query `ORDER BY a, b`, first
> >>> sort by b and afterwards sort again by a. With a stable sort
> >>> implementation this should result in the same ordering. I'm curious
> >>> about the performance, it could avoid dynamic method calls for each
> >>> comparison, but it would process the indices vector multiple times.
> >>>
> >>> Looking forward to any other suggestions or feedback.
> >>>
> >>> --
> >>> Jörn Horstmann | Senior Backend Engineer
> >>>
> >>> www.signavio.com
> >>> Kurfürstenstraße 111, 10787 Berlin, Germany
> >>>
> >>
> >
>

Reply via email to