rluvaton commented on issue #9083:
URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3800851661

   @Dandandan @alamb created an run a benchmark, link for the code in the end.
   
   ----
   
   I created a benchmark with different implementations that all achieve the 
same goal, shuffle to partitions.
   
   the implementations are:
   
   1. take into builders - I implemented `take` that accept a sink to save the 
take data into to avoid reallocation or `concat`
   2. take and concat - using the existing take and then doing concat on each 
column
   3. interleave
   4. existing row format
   5. optimized row format
   
   For each implementation that is no row based I implemented 2 benchmarks one 
when we go for all batches of column1 than all batches of column2 and so on 
(ones that named _column wise_) and another implementation that goes per batch, 
run the `take`/`interleave` on each column in that batch and than go to the 
next batch.
   
   each benchmark get an input laid out so no pre processing on the input is 
needed before running (so `interleave` for example will already get the 
`Vec<(batch_index, row_index)>`).
   
   The benchmark is to shuffle 128 batches of size 8192 rows into 1500 
partitions. so shuffling ~1M rows. this also mean that each partition will only 
have ~700 rows in it.
   
   I excluded the logic for splitting into batches from the benchmark since all 
of it are running in
   
   I also preallocate for the take into builders and both row format enough 
data so no need to copy.
   
   these are the results (sorted from fastest to slowest) running on my local 
MacBook M3 run:
   
   Benchmark | Time (mean) | Time Range
   -- | -- | --
   shuffle/optimized_row_format_approach going partition wise | 2.6734 s | 
[2.6685 s - 2.6790 s]
   shuffle/optimized_row_format_approach | 2.6930 s | [2.6899 s - 2.6958 s]
   shuffle/row_format_approach going partition wise | 3.0178 s | [3.0075 s - 
3.0283 s]
   shuffle/row_format_approach | 3.0346 s | [3.0230 s - 3.0486 s]
   shuffle/interleave_column_wise_approach | 3.4028 s | [3.3219 s - 3.4863 s]
   shuffle/interleave_approach | 3.5126 s | [3.4128 s - 3.6161 s]
   shuffle/take_to_builders_column_wise_approach | 4.2012 s | [4.1025 s - 
4.3140 s]
   shuffle/take_to_builders_approach | 4.2263 s | [4.0417 s - 4.4349 s]
   shuffle/take_approach | 7.0364 s | [6.8709 s - 7.2323 s]
   shuffle/take_column_wise_approach | 8.0610 s | [8.0290 s - 8.0964 s]
   
   
   <details>
   <summary>Raw results</summary>
   
   ```
        Running benches/shuffle.rs 
(target/release/deps/shuffle-b6a3c1bf993fb91d)
   Gnuplot not found, using plotters backend
   Benchmarking shuffle/take_to_builders_approach: Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 40.8s.
   shuffle/take_to_builders_approach
                           time:   [4.0417 s 4.2263 s 4.4349 s]
   Benchmarking shuffle/take_to_builders_column_wise_approach: Warming up for 
3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 44.6s.
   shuffle/take_to_builders_column_wise_approach
                           time:   [4.1025 s 4.2012 s 4.3140 s]
   Found 3 outliers among 10 measurements (30.00%)
     1 (10.00%) low mild
     1 (10.00%) high mild
     1 (10.00%) high severe
   Benchmarking shuffle/take_approach: Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 107.1s.
   shuffle/take_approach   time:   [6.8709 s 7.0364 s 7.2323 s]
   Found 2 outliers among 10 measurements (20.00%)
     2 (20.00%) high mild
   Benchmarking shuffle/take_column_wise_approach: Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 80.8s.
   shuffle/take_column_wise_approach
                           time:   [8.0290 s 8.0610 s 8.0964 s]
   Benchmarking shuffle/interleave_approach: Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 40.0s.
   shuffle/interleave_approach
                           time:   [3.4128 s 3.5126 s 3.6161 s]
   Benchmarking shuffle/interleave_column_wise_approach: Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 34.8s.
   shuffle/interleave_column_wise_approach
                           time:   [3.3219 s 3.4028 s 3.4863 s]
   Benchmarking shuffle/row_format_approach: Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 32.3s.
   shuffle/row_format_approach
                           time:   [3.0230 s 3.0346 s 3.0486 s]
   Found 1 outliers among 10 measurements (10.00%)
     1 (10.00%) high mild
   Benchmarking shuffle/row_format_approach going partition wise: Warming up 
for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 30.3s.
   shuffle/row_format_approach going partition wise
                           time:   [3.0075 s 3.0178 s 3.0283 s]
   Benchmarking shuffle/optimized_row_format_approach: Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 27.0s.
   shuffle/optimized_row_format_approach
                           time:   [2.6899 s 2.6930 s 2.6958 s]
   Benchmarking shuffle/optimized_row_format_approach going partition wise: 
Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 26.7s.
   shuffle/optimized_row_format_approach going partition wise
                           time:   [2.6685 s 2.6734 s 2.6790 s]
   ```
   
   
   </details>
   
   
   
   [**Benchmark 
implementation:**](https://github.com/rluvaton/bench_arrow_shuffle_implementation)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to