Hi,
There are a couple of possibilities here. To sort multiple rows by a
single column, the fastest approach would be to use sort_to_indices [1]
and then use the take [2] kernel to select the corresponding rows. There
are specialized implementations for each of the array types, making this
fairly performant.
If you want to sort lexicographically by multiple columns, there are
lexsort [3] and lexsort_to_indices [4] kernels, that sort the arrays in
place. However, for multicolumn sorts without a limit it will likely be
faster to convert to the row format [5], and use this to perform a
lexsort to indices [6], or sort the rows in place and convert back to
arrow arrays [7]. There is more information about the row format in the
blog series on the topic [8], which may be of interest.
Kind Regards,
Raphael
[1]: https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.sort_to_indices.html
[2]: https://docs.rs/arrow-select/latest/arrow_select/take/fn.take.html
[3]: https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.lexsort.html
[4]:
https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.lexsort_to_indices.html
[5]: https://docs.rs/arrow-row/latest/arrow_row
[6]: https://docs.rs/arrow-row/latest/arrow_row/#lexsort
[7]:
https://docs.rs/arrow-row/latest/arrow_row/struct.RowConverter.html#method.convert_rows
[8]:
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
On 26/02/2023 11:55, Olo Sawyerr wrote:
Hi there,
Hope you're well.
I'm trying to sort a RecordBatch by multiple columns and it's not
obvious how to achieve this. The kernels::sort() only takes a single
ArrayRef.
Any ideas pls?
Regards,
Olo