Well, Inf and -Inf are already ordered ;-) Nan is, as usual, a can of worms. Ordering probably doesn't belong in the Arrow spec (which is only concerned with representing data, not processing it).
In any case, I agree that it makes sense to handle all NaNs as equal when implementing comparison-based functions: we're trying to do that on the C++ side. It also makes sense to choose a well-defined ordering for them (for example "consider all NaNs smaller than non-NaNs", or larger). In some cases, you may also want to let the user alter the behaviour. For example, when summing an array, you may want a NaN in the input to force the result to NaN (as the IEEE spec would say), or you may want NaNs to be ignored. NumPy has two functions (`sum` and `nansum`) for these two different behaviours. Regards Antoine. Le 02/09/2020 à 11:40, Matthias Vallentin a écrit : > Would it perhaps make sense to define the total order for non-numbers (NaN, > Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the > behavior is the same across all languages? > > On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <andygrov...@gmail.com> wrote: > >> Hi Jörn, >> >> I agree with your concerns about NaN. There was a discussion about this in >> https://github.com/apache/arrow/pull/7193 >> >> I will try and make time this weekend to look at the current implementation >> and your suggestions around DictionaryArray. >> >> Hopefully, other contributors that are more familiar with that code can >> respond as well. >> >> Thanks, >> >> Andy. >> >> >> >> On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann < >> joern.horstm...@signavio.com> >> wrote: >> >>> I ran into a few issues with the rust sort kernels and would like to >>> discuss possible solutions. >>> >>> 1. When sorting by multiple columns (lexsort_to_indices) the Float32 >>> and Float64 data types are not supported because the implementation >>> relies on the OrdArray trait. This trait is not implemented because >>> f64/f32 only implements PartialOrd. The sort function for a single >>> column (sort_to_indices) has some special logic which looks like it >>> wants to treats NaN the same as null, but I'm also not convinced this >>> is the correct way. For example postgres does the following >>> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT >> ) >>> >>> "In order to allow floating-point values to be sorted and used in >>> tree-based indexes, PostgreSQL treats NaN values as equal, and greater >>> than all non-NaN values." >>> >>> I propose to do the same in an OrdArray impl for >>> Float64Array/Float32Array and then simplifying the sort_to_indices >>> function accordingly. >>> >>> 2. Sorting for dictionary encoded strings. The problem here is that >>> DictionaryArray does not have a generic parameter for the value type >>> so it is not currently possible to only implement OrdArray for string >>> dictionaries. Again for the single column case, the value data type >>> could be checked and a sort could be implemented by looking up each >>> key in the dictionary. An optimization could be to check the is_sorted >>> flag of DictionaryArray (which does not seem to be used really) and >>> then directly sort by the keys. For the general case I see roughly to >>> options >>> >>> - Somehow implement an OrdArray view of the dictionary array. This >>> could be easier if OrdArray did not extend Array but was a completely >>> separate trait. >>> - Change the lexicographic sort impl to not use dynamic calls but >>> instead sort multiple times. So for a query `ORDER BY a, b`, first >>> sort by b and afterwards sort again by a. With a stable sort >>> implementation this should result in the same ordering. I'm curious >>> about the performance, it could avoid dynamic method calls for each >>> comparison, but it would process the indices vector multiple times. >>> >>> Looking forward to any other suggestions or feedback. >>> >>> -- >>> Jörn Horstmann | Senior Backend Engineer >>> >>> www.signavio.com >>> Kurfürstenstraße 111, 10787 Berlin, Germany >>> >> >