Hi Jörn,

I agree with your concerns about NaN. There was a discussion about this in
https://github.com/apache/arrow/pull/7193

I will try and make time this weekend to look at the current implementation
and your suggestions around DictionaryArray.

Hopefully, other contributors that are more familiar with that code can
respond as well.

Thanks,

Andy.



On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <joern.horstm...@signavio.com>
wrote:

> I ran into a few issues with the rust sort kernels and would like to
> discuss possible solutions.
>
> 1. When sorting by multiple columns (lexsort_to_indices) the Float32
> and Float64 data types are not supported because the implementation
> relies on the OrdArray trait. This trait is not implemented because
> f64/f32 only implements PartialOrd. The sort function for a single
> column (sort_to_indices) has some special logic which looks like it
> wants to treats NaN the same as null, but I'm also not convinced this
> is the correct way. For example postgres does the following
> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)
>
> "In order to allow floating-point values to be sorted and used in
> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> than all non-NaN values."
>
> I propose to do the same in an OrdArray impl for
> Float64Array/Float32Array and then simplifying the sort_to_indices
> function accordingly.
>
> 2. Sorting for dictionary encoded strings. The problem here is that
> DictionaryArray does not have a generic parameter for the value type
> so it is not currently possible to only implement OrdArray for string
> dictionaries. Again for the single column case, the value data type
> could be checked and a sort could be implemented by looking up each
> key in the dictionary. An optimization could be to check the is_sorted
> flag of DictionaryArray (which does not seem to be used really) and
> then directly sort by the keys. For the general case I see roughly to
> options
>
> - Somehow implement an OrdArray view of the dictionary array. This
> could be easier if OrdArray did not extend Array but was a completely
> separate trait.
> - Change the lexicographic sort impl to not use dynamic calls but
> instead sort multiple times. So for a query `ORDER BY a, b`, first
> sort by b and afterwards sort again by a. With a stable sort
> implementation this should result in the same ordering. I'm curious
> about the performance, it could avoid dynamic method calls for each
> comparison, but it would process the indices vector multiple times.
>
> Looking forward to any other suggestions or feedback.
>
> --
> Jörn Horstmann | Senior Backend Engineer
>
> www.signavio.com
> Kurfürstenstraße 111, 10787 Berlin, Germany
>

Reply via email to