Well, Inf and -Inf are already ordered ;-)  Nan is, as usual, a can of
worms.  Ordering probably doesn't belong in the Arrow spec (which is
only concerned with representing data, not processing it).

In any case, I agree that it makes sense to handle all NaNs as equal
when implementing comparison-based functions: we're trying to do that on
the C++ side.  It also makes sense to choose a well-defined ordering for
them (for example "consider all NaNs smaller than non-NaNs", or larger).

In some cases, you may also want to let the user alter the behaviour.
For example, when summing an array, you may want a NaN in the input to
force the result to NaN (as the IEEE spec would say), or you may want
NaNs to be ignored.  NumPy has two functions (`sum` and `nansum`) for
these two different behaviours.

Regards

Antoine.


Le 02/09/2020 à 11:40, Matthias Vallentin a écrit :
> Would it perhaps make sense to define the total order for non-numbers (NaN,
> Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the
> behavior is the same across all languages?
> 
> On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <andygrov...@gmail.com> wrote:
> 
>> Hi Jörn,
>>
>> I agree with your concerns about NaN. There was a discussion about this in
>> https://github.com/apache/arrow/pull/7193
>>
>> I will try and make time this weekend to look at the current implementation
>> and your suggestions around DictionaryArray.
>>
>> Hopefully, other contributors that are more familiar with that code can
>> respond as well.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>>
>> On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <
>> joern.horstm...@signavio.com>
>> wrote:
>>
>>> I ran into a few issues with the rust sort kernels and would like to
>>> discuss possible solutions.
>>>
>>> 1. When sorting by multiple columns (lexsort_to_indices) the Float32
>>> and Float64 data types are not supported because the implementation
>>> relies on the OrdArray trait. This trait is not implemented because
>>> f64/f32 only implements PartialOrd. The sort function for a single
>>> column (sort_to_indices) has some special logic which looks like it
>>> wants to treats NaN the same as null, but I'm also not convinced this
>>> is the correct way. For example postgres does the following
>>> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT
>> )
>>>
>>> "In order to allow floating-point values to be sorted and used in
>>> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
>>> than all non-NaN values."
>>>
>>> I propose to do the same in an OrdArray impl for
>>> Float64Array/Float32Array and then simplifying the sort_to_indices
>>> function accordingly.
>>>
>>> 2. Sorting for dictionary encoded strings. The problem here is that
>>> DictionaryArray does not have a generic parameter for the value type
>>> so it is not currently possible to only implement OrdArray for string
>>> dictionaries. Again for the single column case, the value data type
>>> could be checked and a sort could be implemented by looking up each
>>> key in the dictionary. An optimization could be to check the is_sorted
>>> flag of DictionaryArray (which does not seem to be used really) and
>>> then directly sort by the keys. For the general case I see roughly to
>>> options
>>>
>>> - Somehow implement an OrdArray view of the dictionary array. This
>>> could be easier if OrdArray did not extend Array but was a completely
>>> separate trait.
>>> - Change the lexicographic sort impl to not use dynamic calls but
>>> instead sort multiple times. So for a query `ORDER BY a, b`, first
>>> sort by b and afterwards sort again by a. With a stable sort
>>> implementation this should result in the same ordering. I'm curious
>>> about the performance, it could avoid dynamic method calls for each
>>> comparison, but it would process the indices vector multiple times.
>>>
>>> Looking forward to any other suggestions or feedback.
>>>
>>> --
>>> Jörn Horstmann | Senior Backend Engineer
>>>
>>> www.signavio.com
>>> Kurfürstenstraße 111, 10787 Berlin, Germany
>>>
>>
> 

Reply via email to