[I] perf: sort by DictionaryArray first faster [arrow-rs]

via GitHub Tue, 11 Mar 2025 08:37:52 -0700


discord9 opened a new issue, #7267:
URL: https://github.com/apache/arrow-rs/issues/7267


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Can we sort by `DictionaryArray`  first and then some column be faster? i.e: 
sort by a, b, c columns in which a
   is a dictionary Array.
   
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   For now, if there is multiple sort expr, arrow will fallback to use normal 
lexsort. But if the first expr is a dictionary array, and we sort it by first 
expr first then do lexsort with rest of sort expr, it can be much faster.
   Also the condition to trigger this optimize can vary more, i.e. only when 
the dictionary array have small enough distinct values to keys ratio, or sort 
columns count is large enough
   
   Here I provide a simple bench report to show the idea is useful, the 
first_sort use this optimize, and all sort simply run lexsort_to_indices. Can 
observer up to 56% speedup when distinct number is smaller
   ```bash
   all_sort_100_distinct   time:   [68.235 µs 68.577 µs 68.958 µs]
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   first_sort_100_distinct time:   [40.449 µs 40.633 µs 40.841 µs]
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   all_sort_1000_distinct  time:   [65.892 µs 66.193 µs 66.538 µs]
   Found 4 outliers among 100 measurements (4.00%)
     1 (1.00%) high mild
     3 (3.00%) high severe
   
   first_sort_1000_distinct
                           time:   [72.761 µs 73.024 µs 73.323 µs]
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   all_sort_10_distinct    time:   [78.826 µs 79.160 µs 79.510 µs]
   
   first_sort_10_distinct  time:   [34.844 µs 34.997 µs 35.157 µs]
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   ```
    The bench program is as following:
   ```rust
   use arrow::{
       array::{DictionaryArray, StringArray, UInt32Array},
       compute::{LexicographicalComparator, SortColumn,  lexsort_to_indices},
       datatypes::UInt32Type,
   };
   use criterion::{Criterion, criterion_group, criterion_main};
   use rand::prelude::*;
   use std::{hint::black_box, sync::Arc};
   
   fn gen_uint32_array(len: usize, rng: &mut impl Rng) -> UInt32Array {
       let mut values = vec![0u32; len];
       values.iter_mut().for_each(|v| *v = rng.random());
       UInt32Array::from(values)
   }
   
   fn gen_str_array(
       len: usize,
       distinct_cnt: usize,
       rng: &mut impl Rng,
   ) -> DictionaryArray<UInt32Type> {
       let mut values = vec![0u32; len];
       let mut distinct: Vec<String> = vec![];
       for _ in 0..distinct_cnt {
           distinct.push(format!("{}", rng.random::<u32>()));
       }
       // fill values with random distinct strings
       values.iter_mut().for_each(|v| *v = rng.random::<u32>() % distinct_cnt 
as u32);
   
       let keys = UInt32Array::from(values);
       let values = Arc::new(StringArray::from(distinct)) as _;
       DictionaryArray::try_new(keys, values).unwrap()
   }
   
   fn gen_sort_columns(len: usize, distinct_cnt: usize, seed: usize) -> 
Vec<SortColumn> {
       let mut rng = SmallRng::seed_from_u64(seed as u64);
       let mut res = Vec::new();
       let first = gen_str_array(len, distinct_cnt, &mut rng);
       res.push(SortColumn {
           values: Arc::new(first),
           options: None,
       });
       for _ in 0..10 {
           let col = gen_uint32_array(len, &mut rng);
           res.push(SortColumn {
               values: Arc::new(col),
               options: None,
           });
       }
       res
   }
   
   /// first sort by first column, then others, should reduec a lot of 
comparator calls since it's mostly sorted
   fn first_sort(columns: &[SortColumn]) -> Vec<usize> {
       let first = lexsort_to_indices(&columns[0..1], None).unwrap();
       let lex_comp = 
LexicographicalComparator::try_new(&columns[1..]).unwrap();
       let mut first = first
           .values()
           .iter()
           .map(|v| *v as usize)
           .collect::<Vec<usize>>();
       first.sort_unstable_by(|a, b| lex_comp.compare(*a, *b));
       first
   }
   
   fn all_sort(columns: &[SortColumn]) -> Vec<usize> {
       lexsort_to_indices(columns, None)
           .unwrap()
           .values()
           .iter()
           .map(|v| *v as usize)
           .collect::<Vec<usize>>()
   }
   
   fn criterion_benchmark(c: &mut Criterion) {
       c.bench_function("all_sort_100_distinct", |b| {
           let columns = gen_sort_columns(1000, 100, 0);
           b.iter(|| all_sort(black_box(&columns)));
       });
   
       c.bench_function("first_sort_100_distinct", |b| {
           let columns = gen_sort_columns(1000, 100, 0);
           b.iter(|| first_sort(black_box(&columns)));
       });
   
       c.bench_function("all_sort_1000_distinct", |b| {
           let columns = gen_sort_columns(1000, 1000, 0);
           b.iter(|| all_sort(black_box(&columns)));
       });
   
       c.bench_function("first_sort_1000_distinct", |b| {
           let columns = gen_sort_columns(1000, 1000, 0);
           b.iter(|| first_sort(black_box(&columns)));
       });
   
       c.bench_function("all_sort_10_distinct", |b| {
           let columns = gen_sort_columns(1000, 10, 0);
           b.iter(|| all_sort(black_box(&columns)));
       });
   
       c.bench_function("first_sort_10_distinct", |b| {
           let columns = gen_sort_columns(1000, 10, 0);
           b.iter(|| first_sort(black_box(&columns)));
       });
   }
   
   criterion_group!(benches, criterion_benchmark);
   criterion_main!(benches);
   
   
   #[test]
   fn test_sort_correct(){
       let columns = gen_sort_columns(1000, 100, 0);
       let first = first_sort(&columns);
       let all = all_sort(&columns);
       assert_eq!(first, all);
   }
   ```
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] perf: sort by DictionaryArray first faster [arrow-rs]

Reply via email to