rluvaton commented on issue #9083:
URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789605

   
   ## Bottleneck 2: Branch misprediction for 50 columns non-nullable string 
arrays with string length between 1..=31
   > Note: Using non-empty string to avoid the empty string case and giving the 
branch predictor the optimal case
   > less than 32 bytes so it will be less than block size so it will go to the 
mini block path
   > and no nulls to focus on a specific code path
   
   <details><summary>Benchmark code:</summary>
   <p>
   
   ```rust
   fn do_bench(c: &mut Criterion, name: &str, cols: Vec<ArrayRef>) {
     let fields: Vec<_> = cols
       .iter()
       .map(|x| SortField::new(x.data_type().clone()))
       .collect();
     let mut group = c.benchmark_group("row_format");
     group.throughput(criterion::Throughput::Elements(cols.len() as u64));
   
     group.bench_function(&format!("convert_columns {name}"), |b| {
       b.iter(|| {
         let converter = RowConverter::new(fields.clone()).unwrap();
         hint::black_box(converter.convert_columns(&cols).unwrap())
       });
     });
   }
   
   
   struct BinaryGenOptions {
     num_cols: usize,
     batch_size: usize,
     min_len: usize,
     max_len: usize,
   }
   
   fn run_bench_on_multi_column_with_same_string_data_type_with_no_nulls(
     c: &mut Criterion,
     opts: BinaryGenOptions
   ) {
     let mut seed = 0;
   
     let mut cols: Vec<ArrayRef> = vec![];
   
     for _ in 0..opts.num_cols {
       seed += 1;
       cols.push(
         Arc::new(create_string_array_with_len_range_and_prefix_and_seed::<i32>(
           opts.batch_size, 0.0, opts.min_len, opts.max_len, "", seed,
         )) as ArrayRef,
       );
     }
   
     do_bench(
       c,
       format!(
         "{} columns of StringArray str range: {}..={} batch_size: {} with no 
nulls",
         opts.num_cols,
         opts.min_len,
         opts.max_len,
         opts.batch_size
       )
         .as_str(),
       cols,
     );
   }
   
   fn row_bench(c: &mut Criterion) {
     run_bench_on_multi_column_with_same_string_data_type_with_no_nulls(
       c,
       BinaryGenOptions {
         batch_size: 8192,
         num_cols: 50,
         min_len: 1,
         max_len: 31,
       },
     );
   }
   
   
   criterion_group! {
       name = benches;
       config = Criterion::default().measurement_time(Duration::from_secs(10));
       targets = row_bench
   }
   ```
   
   </p>
   </details> 
   
   Results on main:
   ```bash
   $ cargo bench --features=test_utils --bench row_format --profile=profiling 
-- --save-baseline=main
       Finished `profiling` profile [optimized + debuginfo] target(s) in 4.74s
        Running benches/row_format.rs 
(target/profiling/deps/row_format-7489cf224545c7d6)
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.012
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [8.3129 ms 8.3163 ms 8.3200 ms]
                           thrpt:  [6.0096 Kelem/s 6.0123 Kelem/s 6.0148 
Kelem/s]
   Found 8 outliers among 100 measurements (8.00%)
     5 (5.00%) high mild
     3 (3.00%) high severe
   ```
   
   > [!NOTE]  
   > TL;DR: the problem here is the CPU branch miss prediction that it fail to 
predict the number of mini blocks in each value as it is uniformly randomized
   >
   > Allowing us to avoid mini-blocks that was done as a way to reduce space 
amplification for small string
   > by encoding each non-empty string in `<len><slice>` where the length is 
the smallest possible type that can represent the string length (explained more 
below)
   
   
   <details><summary>Step by step finding the problem 
(<code>Branch_Mispredicts</code>):</summary>
   
   
   
   <details><summary>Running Top-Down analysis - Level 1</summary>
   <p>
   
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l1 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.073
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [8.3223 ms 8.3255 ms 8.3288 ms]
                           thrpt:  [6.0033 Kelem/s 6.0057 Kelem/s 6.0079 
Kelem/s]
   Found 8 outliers among 100 measurements (8.00%)
     7 (7.00%) high mild
     1 (1.00%) high severe
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound   % Slots                       23.4
   BAD              Bad_Speculation  % Slots                       26.5   <==
   BE               Backend_Bound    % Slots                       11.8  <
   RET              Retiring         % Slots                       38.3  <
   MUX                               %                            100.00
   Run toplev --describe Bad_Speculation^ to get more information on bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Bad_Speculation*/2,+MUX' for breakdown.
   ```
   
   </p>
   </details> 
   
   It looks like the problem is `Bad_Speculation`.
   
   
   <details><summary>Running Top-Down analysis - Level 2</summary>
   <p>
   
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l2 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.133
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [8.3452 ms 8.3481 ms 8.3513 ms]
                           thrpt:  [5.9871 Kelem/s 5.9894 Kelem/s 5.9914 
Kelem/s]
                    change:
                           time:   [+0.2160% +0.2718% +0.3277%] (p = 0.00 < 
0.05)
                           thrpt:  [−0.3267% −0.2710% −0.2156%]
                           Change within noise threshold.
   Found 8 outliers among 100 measurements (8.00%)
     4 (4.00%) high mild
     4 (4.00%) high severe
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound                      % Slots                 
      23.0    [14.0%]
   BAD              Bad_Speculation                     % Slots                 
      25.9    [14.0%]
   BE               Backend_Bound                       % Slots                 
      12.0  < [14.0%]
   RET              Retiring                            % Slots                 
      39.1  < [14.0%]
   FE               Frontend_Bound.Fetch_Latency        % Slots                 
       9.2  < [14.0%]
   FE               Frontend_Bound.Fetch_Bandwidth      % Slots                 
      13.8  < [14.0%]
   BAD              Bad_Speculation.Branch_Mispredicts  % Slots                 
      25.9    [14.0%]<==
   BAD              Bad_Speculation.Machine_Clears      % Slots                 
       0.0  < [14.0%]
   BE/Mem           Backend_Bound.Memory_Bound          % Slots                 
       4.0  < [14.0%]
   BE/Core          Backend_Bound.Core_Bound            % Slots                 
       8.1  < [14.0%]
   RET              Retiring.Light_Operations           % Slots                 
      36.9  < [14.0%]
   RET              Retiring.Heavy_Operations           % Slots                 
       2.2  < [14.0%]
   MUX                                                  %                       
      14.00
   Run toplev --describe Branch_Mispredicts^ to get more information on 
bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
   ```
   
   </p>
   </details> 
   
   it is `Branch_Mispredicts`
   
   <details><summary>Running Top-Down analysis - Level 3</summary>
   <p>
   
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l3 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.123
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [8.3599 ms 8.3622 ms 8.3647 ms]
                           thrpt:  [5.9775 Kelem/s 5.9793 Kelem/s 5.9809 
Kelem/s]
                    change:
                           time:   [+0.1217% +0.1692% +0.2131%] (p = 0.00 < 
0.05)
                           thrpt:  [−0.2126% −0.1689% −0.1215%]
                           Change within noise threshold.
   Found 7 outliers among 100 measurements (7.00%)
     7 (7.00%) high mild
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound                                        % 
Slots                       23.0    [ 3.0%]
   BAD              Bad_Speculation.Branch_Mispredicts.Other_Mispredicts  % 
Slots                        0.3  < [ 3.0%]
   BAD              Bad_Speculation.Machine_Clears.Other_Nukes            % 
Slots                        0.0  <
   BE/Core          Backend_Bound.Core_Bound.Serializing_Operation        % 
Clocks                       0.0  < [ 3.0%]
   BAD              Bad_Speculation                                       % 
Slots                       26.0    [ 3.0%]
   BE               Backend_Bound                                         % 
Slots                       12.1  < [ 3.0%]
   RET              Retiring                                              % 
Slots                       38.8  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency                          % 
Slots                        9.1  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth                        % 
Slots                       13.8  < [ 3.0%]
   BAD              Bad_Speculation.Branch_Mispredicts                    % 
Slots                       26.0    [ 3.0%]<==
   BAD              Bad_Speculation.Machine_Clears                        % 
Slots                        0.0  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound                            % 
Slots                        4.0  < [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound                              % 
Slots                        8.1  < [ 3.0%]
   RET              Retiring.Light_Operations                             % 
Slots                       36.6  < [ 3.0%]
   RET              Retiring.Heavy_Operations                             % 
Slots                        2.3  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.ICache_Misses            % 
Clocks                       0.1  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.ITLB_Misses              % 
Clocks                       0.0  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.Branch_Resteers          % 
Clocks                       6.4  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.MS_Switches              % 
Clocks_est                   0.3  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.LCP                      % 
Clocks                       0.0  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.DSB_Switches             % 
Clocks                       8.3  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth.MITE                   % 
Slots_est                    9.3  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth.DSB                    % 
Slots_est                    6.9  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L1_Bound                   % 
Stalls                       9.1  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L2_Bound                   % 
Stalls                       0.4  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L3_Bound                   % 
Stalls                       1.3  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.DRAM_Bound                 % 
Stalls                       0.7  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.Store_Bound                % 
Stalls                       4.0  < [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound.Divider                      % 
Clocks                       0.0  < [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound.Ports_Utilization            % 
Clocks                      15.6  < [ 3.0%]
   RET              Retiring.Light_Operations.FP_Arith                    % 
Uops                         0.5  < [ 3.0%]
   RET              Retiring.Light_Operations.Memory_Operations           % 
Slots                       13.9  < [ 3.0%]
   RET              Retiring.Light_Operations.Fused_Instructions          % 
Slots                        5.9  < [ 3.0%]
   RET              Retiring.Light_Operations.Non_Fused_Branches          % 
Slots                        1.8  < [ 3.0%]
   RET              Retiring.Light_Operations.Other_Light_Ops             % 
Slots                       14.5  < [ 3.0%]
   RET              Retiring.Heavy_Operations.Few_Uops_Instructions       % 
Slots                        2.1  < [ 3.0%]
   RET              Retiring.Heavy_Operations.Microcode_Sequencer         % 
Slots                        0.1  < [ 3.0%]
   MUX                                                                    %     
                         3.00
   Run toplev --describe Branch_Mispredicts^ to get more information on 
bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
   ```
   
   </p>
   </details> 
   
   
   
   <details><summary>Recording using perf the misprediction and all 
branches</summary>
   <p>
   
   
   Running perf record to see where most mispredictions while showing 
(mispredictions, `all_branches`)
   ```bash
   $ perf record -e 
'{br_misp_retired.all_branches,br_inst_retired.all_branches}:pp' -g 
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1 
--noplot
   ```
   
   looking at the report:
   ```bash
   $ perf report -g --group --percent-limit 0.01 --no-children
   ```
   
   </p>
   </details> 
   
   
   the most expensive in term of misprediction is 
`arrow_row::variable::encode_one`:
   ```bash
   +   67.00%  43.93%  row_format-7489  row_format-7489cf224545c7d6  [.] 
arrow_row::variable::encode_one
   ```
   
   
   <details><summary>annotated info</summary>
   <p>
   
   ```asm
   Percent       │     arrow_row::variable::encode_blocks:                      
                                                                                
  ▒
                 │       mov     %ecx,%r14d                                     
                                                                                
  ▒
                 │       and     $0x7,%r14d                                     
                                                                                
  ▒
                 │     <core::slice::iter::ChunksExact<T> as 
core::iter::traits::iterator::Iterator>::size_hint:                             
                     ▒
                 │       mov     %rcx,%rax                                      
                                                                                
  ▒
                 │       shr     $0x3,%rax                                      
                                                                                
  ▒
                 │     core::cmp::Ord::min:                                     
                                                                                
  ▒
                 │       cmp     %rax,%r9                                       
                                                                                
  ▒
                 │       cmovb   %r9,%rax                                       
                                                                                
  ▒
                 │     <core::iter::adapters::zip::Zip<A,B> as 
core::iter::adapters::zip::ZipImpl<A,B>>::next:                                 
                   ◆
     3.08   3.11 │       test    %rax,%rax                                      
                                                                                
  ▒
    26.75   0.78 │     ↓ je      aa                                             
                                                                                
  ▒
                 │     arrow_row::variable::encode_blocks:                      
                                                                                
  ▒
                 │     let remainder = chunks.remainder();                      
                                                                                
  ▒
                 │     for (input, output) in 
chunks.clone().zip(to_write.chunks_exact_mut(SIZE + 1)) {                       
                                    ▒
                 │     let input: &[u8; SIZE] = input.try_into().unwrap();      
                                                                                
  ▒
                 │     let out_block: &mut [u8; SIZE] = (&mut 
output[..SIZE]).try_into().unwrap();                                            
                    ▒
                 │                                                              
                                                                                
  ▒
                 │     *out_block = *input;                                     
                                                                                
  ▒
                 │       mov     (%rdx),%r9                                     
                                                                                
  ▒
                 │       mov     %r9,0x1(%rdi)                                  
                                                                                
  ▒
                 │                                                              
                                                                                
  ▒
                 │     // Indicate that there are further blocks to follow      
                                                                                
  ▒
                 │     output[SIZE] = BLOCK_CONTINUATION;                       
                                                                                
  ▒
                 │       movb    $0xff,0x9(%rdi)                                
                                                                                
  ▒
                 │     <core::iter::adapters::zip::Zip<A,B> as 
core::iter::adapters::zip::ZipImpl<A,B>>::next:                                 
                   ▒
     6.92   5.40 │    ┌──cmp     $0x1,%rax                                      
                                                                                
  ▒
    26.92   2.92 │    ├──je      aa                                             
                                                                                
  ▒
                 │    │arrow_row::variable::encode_blocks:                      
                                                                                
  ▒
                 │    │*out_block = *input;                                     
                                                                                
  ▒
                 │    │  mov     0x8(%rdx),%r9                                  
                                                                                
  ▒
                 │    │  mov     %r9,0xa(%rdi)                                  
                                                                                
  ▒
                 │    │output[SIZE] = BLOCK_CONTINUATION;                       
                                                                                
  ▒
                 │    │  movb    $0xff,0x12(%rdi)                               
                                                                                
  ▒
                 │    │<core::iter::adapters::zip::Zip<A,B> as 
core::iter::adapters::zip::ZipImpl<A,B>>::next:                                 
                   ▒
    13.33   3.66 │    │  cmp     $0x2,%rax                                      
                                                                                
  ▒
    16.83   3.57 │    │↓ je      aa                                             
                                                                                
  ▒
                 │    │arrow_row::variable::encode_blocks:                      
                                                                                
  ▒
                 │    │*out_block = *input;                                     
                                                                                
  ▒
                 │    │  mov     0x10(%rdx),%r9                                 
                                                                                
  ▒
                 │    │  mov     %r9,0x13(%rdi)                                 
                                                                                
  ▒
                 │    │output[SIZE] = BLOCK_CONTINUATION;                       
                                                                                
  ▒
                 │    │  movb    $0xff,0x1b(%rdi)                               
                                                                                
  ▒
                 │    │<core::iter::adapters::zip::Zip<A,B> as 
core::iter::adapters::zip::ZipImpl<A,B>>::next:                                 
                   ▒
                 │    │  cmp     $0x3,%rax                                      
                                                                                
  ▒
     0.00   5.24 │    │↓ je      aa                                             
                                                                                
  ▒
                 │    │arrow_row::variable::encode_blocks:                      
                                                                                
  ▒
                 │    │*out_block = *input;                                     
                                                                                
  ▒
                 │    │  mov     0x18(%rdx),%rax                                
                                                                                
  ▒
                 │    │  mov     %rax,0x1c(%rdi)                                
                                                                                
  ▒
                 │    │output[SIZE] = BLOCK_CONTINUATION;                       
                                                                                
  ▒
                 │    │  movb    $0xff,0x24(%rdi)                               
                                                                                
  ▒
                 │    │arrow_row::variable::encode_one:                         
                                                                                
  ▒
                 │    │1 + encode_blocks::<MINI_BLOCK_SIZE>(&mut out[1..], val) 
                                                                                
  ▒
   ```
   
   </p>
   </details> 
   
   
   
   </details> 
   
   
   we see that the misprediction is on the iterator of the blocks in 
`variable::encode_blocks` function
   ```rust
   for (input, output) in chunks.clone().zip(to_write.chunks_exact_mut(SIZE + 
1)) {
   ```
   
   changing the code to avoid the blocks entirely
   (as they are a way to have a slice of unsized bytes without prefix length so 
sorting will work on different lengths bytes)
   
   by encoding the length at the start (in the smallest type possible to avoid 
a lot of space for small string)
   and then having the slice as is:
   
   <details><summary>updated code:</summary>
   <p>
   
   ```rust
   /// Faster encode_blocks that first copy all the data and then iterate over 
it and
   #[inline]
   fn fast_encode_bytes(out: &mut [u8], val: &[u8], opts: SortOptions) -> usize 
{
       // Write `2_u8` to demarcate as non-empty, non-null string
       out[0] = NON_EMPTY_SENTINEL;
   
       // TODO - in desc should do max minus the length so the order will be 
different (longer strings sort before shorter ones)
       let start_data_offset = {
           let len = val.len();
   
           match get_number_of_bits_needed_to_encode(len) {
               0 => {
                   out[0] = EMPTY_SENTINEL;
                   return 1;
               }
               1 => {
                   out[0] |= LENGTH_TYPE_U8;
   
                   // encode length
                   let start_data_offset = 1 + size_of::<u8>();
                   unsafe { out.get_unchecked_mut(1..start_data_offset) 
}.copy_from_slice(&(len as u8).to_be_bytes());
   
                   start_data_offset
               }
               2 => {
                   out[0] |= LENGTH_TYPE_U16;
   
                   // encode length
                   let start_data_offset = 1 + size_of::<u16>();
                   unsafe { out.get_unchecked_mut(1..start_data_offset) 
}.copy_from_slice(&(len as u16).to_be_bytes());
   
                   start_data_offset
               }
               4 => {
                   out[0] |= LENGTH_TYPE_U32;
   
                   // encode length
                   let start_data_offset = 1 + size_of::<u32>();
                   unsafe { out.get_unchecked_mut(1..start_data_offset) 
}.copy_from_slice(&(len as u32).to_be_bytes());
   
                   start_data_offset
               }
               8 => {
                   out[0] |= LENGTH_TYPE_U64;
   
                   // encode length
                   let start_data_offset = 1 + size_of::<u64>();
                   unsafe { out.get_unchecked_mut(1..start_data_offset) 
}.copy_from_slice(&(len as u64).to_be_bytes());
   
                   start_data_offset
               }
               bits_required => {
                   unreachable!("invalid length type {len}. numbr of bits 
required {bits_required}");
               }
           }
       };
   
       let len = start_data_offset + val.len();
       out[start_data_offset..len].copy_from_slice(val);
   
       len
   }
   ```
   
   </p>
   </details> 
   
   
   We will have 33% time improvement.
   ```bash
   $ cargo bench --features=test_utils --bench row_format --profile=profiling 
-- --baseline=main
       Finished `profiling` profile [optimized + debuginfo] target(s) in 4.69s
        Running benches/row_format.rs 
(target/profiling/deps/row_format-7489cf224545c7d6)
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.059
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [5.5475 ms 5.5514 ms 5.5556 ms]
                           thrpt:  [9.0000 Kelem/s 9.0067 Kelem/s 9.0131 
Kelem/s]
                    change:
                           time:   [−33.301% −33.246% −33.187%] (p = 0.00 < 
0.05)
                           thrpt:  [+49.672% +49.804% +49.926%]
                           Performance has improved.
   ```
   
   the type of the slice can be put at the metadata at the beginning of the 
rows if performance is better (I don't think so).
   
   
   <details><summary>Running Top-Down analysis again on the updated 
code:</summary>
   <p>
   
   
   and if we run top level again for level 1:
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l1 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.018
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [5.5393 ms 5.5431 ms 5.5469 ms]
                           thrpt:  [9.0140 Kelem/s 9.0203 Kelem/s 9.0264 
Kelem/s]
                    change:
                           time:   [−34.268% −34.217% −34.168%] (p = 0.00 < 
0.05)
                           thrpt:  [+51.901% +52.016% +52.134%]
                           Performance has improved.
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound   % Slots                       20.7
   BAD              Bad_Speculation  % Slots                       21.5   <==
   BE               Backend_Bound    % Slots                       14.6  <
   RET              Retiring         % Slots                       43.3  <
   MUX                               %                            100.00
   Run toplev --describe Bad_Speculation^ to get more information on bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Bad_Speculation*/2,+MUX' for breakdown.
   ```
   
   and level 2:
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l2 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.082
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [5.5274 ms 5.5308 ms 5.5344 ms]
                           thrpt:  [9.0344 Kelem/s 9.0403 Kelem/s 9.0459 
Kelem/s]
                    change:
                           time:   [−0.3163% −0.2212% −0.1278%] (p = 0.00 < 
0.05)
                           thrpt:  [+0.1280% +0.2217% +0.3173%]
                           Change within noise threshold.
   Found 18 outliers among 100 measurements (18.00%)
     18 (18.00%) high mild
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound                      % Slots                 
      20.7    [14.0%]
   BAD              Bad_Speculation                     % Slots                 
      21.5    [14.0%]
   BE               Backend_Bound                       % Slots                 
      14.5  < [14.0%]
   RET              Retiring                            % Slots                 
      43.3  < [14.0%]
   FE               Frontend_Bound.Fetch_Latency        % Slots                 
      10.6    [14.0%]
   FE               Frontend_Bound.Fetch_Bandwidth      % Slots                 
      10.1  < [14.0%]
   BAD              Bad_Speculation.Branch_Mispredicts  % Slots                 
      21.5    [14.0%]<==
   BAD              Bad_Speculation.Machine_Clears      % Slots                 
       0.0  < [14.0%]
   BE/Mem           Backend_Bound.Memory_Bound          % Slots                 
       5.6  < [14.0%]
   BE/Core          Backend_Bound.Core_Bound            % Slots                 
       8.9  < [14.0%]
   RET              Retiring.Light_Operations           % Slots                 
      39.9  < [14.0%]
   RET              Retiring.Heavy_Operations           % Slots                 
       3.4  < [14.0%]
   MUX                                                  %                       
      14.00
   Run toplev --describe Branch_Mispredicts^ to get more information on 
bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
   ```
   
   and level 3:
   ```bash
   $ ~/pmu-tools/toplev.py --core S0-C0 -l3 --single-thread -v --no-desc 
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench 
--warm-up-time 1 --noplot
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.003
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [5.5268 ms 5.5302 ms 5.5337 ms]
                           thrpt:  [9.0355 Kelem/s 9.0413 Kelem/s 9.0468 
Kelem/s]
                    change:
                           time:   [−0.1050% −0.0117% +0.0828%] (p = 0.80 > 
0.05)
                           thrpt:  [−0.0827% +0.0117% +0.1051%]
                           No change in performance detected.
   Found 15 outliers among 100 measurements (15.00%)
     15 (15.00%) high mild
   
   # 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz 
[clx/skylake]
   FE               Frontend_Bound                                        % 
Slots                       20.8    [ 3.0%]
   BAD              Bad_Speculation.Branch_Mispredicts.Other_Mispredicts  % 
Slots                        0.0  < [ 3.0%]
   BAD              Bad_Speculation.Machine_Clears.Other_Nukes            % 
Slots                        0.0  <
   BE/Core          Backend_Bound.Core_Bound.Serializing_Operation        % 
Clocks                       0.0  < [ 3.0%]
   BAD              Bad_Speculation                                       % 
Slots                       21.6    [ 3.0%]
   BE               Backend_Bound                                         % 
Slots                       14.7  < [ 3.0%]
   RET              Retiring                                              % 
Slots                       43.0  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency                          % 
Slots                       10.6    [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth                        % 
Slots                       10.3    [ 3.0%]
   BAD              Bad_Speculation.Branch_Mispredicts                    % 
Slots                       21.6    [ 3.0%]<==
   BAD              Bad_Speculation.Machine_Clears                        % 
Slots                        0.0  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound                            % 
Slots                        5.7  < [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound                              % 
Slots                        9.0  < [ 3.0%]
   RET              Retiring.Light_Operations                             % 
Slots                       39.6  < [ 3.0%]
   RET              Retiring.Heavy_Operations                             % 
Slots                        3.4  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.ICache_Misses            % 
Clocks                       0.1  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.ITLB_Misses              % 
Clocks                       0.0  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.Branch_Resteers          % 
Clocks                       5.4    [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.MS_Switches              % 
Clocks_est                   0.3  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.LCP                      % 
Clocks                       0.0    [ 3.0%]
   FE               Frontend_Bound.Fetch_Latency.DSB_Switches             % 
Clocks                       6.2    [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth.MITE                   % 
Slots_est                    7.7  < [ 3.0%]
   FE               Frontend_Bound.Fetch_Bandwidth.DSB                    % 
Slots_est                    4.1  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L1_Bound                   % 
Stalls                      12.9  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L2_Bound                   % 
Stalls                       0.5  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.L3_Bound                   % 
Stalls                       1.4  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.DRAM_Bound                 % 
Stalls                       0.5  < [ 3.0%]
   BE/Mem           Backend_Bound.Memory_Bound.Store_Bound                % 
Stalls                       4.8  < [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound.Divider                      % 
Clocks                       0.0  < [ 3.0%]
   BE/Core          Backend_Bound.Core_Bound.Ports_Utilization            % 
Clocks                      22.1  < [ 3.0%]
   RET              Retiring.Light_Operations.FP_Arith                    % 
Uops                         0.4  < [ 3.0%]
   RET              Retiring.Light_Operations.Memory_Operations           % 
Slots                       15.6  < [ 3.0%]
   RET              Retiring.Light_Operations.Fused_Instructions          % 
Slots                        5.1  < [ 3.0%]
   RET              Retiring.Light_Operations.Non_Fused_Branches          % 
Slots                        3.0  < [ 3.0%]
   RET              Retiring.Light_Operations.Other_Light_Ops             % 
Slots                       15.4  < [ 3.0%]
   RET              Retiring.Heavy_Operations.Few_Uops_Instructions       % 
Slots                        3.3  < [ 3.0%]
   RET              Retiring.Heavy_Operations.Microcode_Sequencer         % 
Slots                        0.1  < [ 3.0%]
   MUX                                                                    %     
                         3.00
   Run toplev --describe Branch_Mispredicts^ to get more information on 
bottleneck
   Add --run-sample to find locations
   Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
   ```
   
   
   We are still branch misprediction bounded but much less.
   
   Lets see what was mispredicted:
   ```bash
   $ perf record -e 
'{br_misp_retired.all_branches,br_inst_retired.all_branches}:pp' -g 
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1 
--noplot
   $ perf report -g --group --percent-limit 0.01 --no-children
   ```
   
   we are now seeing this:
   ```bash
   Samples: 43K of events 'anon group { br_misp_retired.all_branches, 
br_inst_retired.all_branches }', Event count (approx.): 17292789079
             Overhead  Command          Shared Object                Symbol
   +   97.68%  23.41%  row_format-7489  libc.so.6                    [.] 
__memmove_evex_unaligned_erms
   +    1.80%   0.71%  row_format-7489  row_format-7489cf224545c7d6  [.] 
core::slice::sort::shared::smallsort::small_sort_network
   +    0.21%  33.24%  row_format-7489  row_format-7489cf224545c7d6  [.] 
arrow_row::variable::fast_encode_bytes
   +    0.11%   0.41%  row_format-7489  row_format-7489cf224545c7d6  [.] 
core::slice::sort::unstable::quicksort::quicksort
   +    0.11%   0.26%  row_format-7489  row_format-7489cf224545c7d6  [.] 
<alloc::string::String as core::iter::traits::collect::Extend<char>>::extend
   +    0.05%  24.94%  row_format-7489  row_format-7489cf224545c7d6  [.] 
arrow_row::row_lengths
   +    0.05%   0.30%  row_format-7489  row_format-7489cf224545c7d6  [.] 
criterion::stats::univariate::mixed::bootstrap
   
   ```
   
   
   Running stats:
   ```bash
   $ perf stat  -e br_misp_retired.all_branches,int_misc.recovery_cycles,cycles 
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1 
--noplot
   Benchmarking row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated 
10.108
   row_format/convert_columns 50 columns of StringArray str range: 1..=31 
batch_size: 8192 with no null...
                           time:   [5.5396 ms 5.5437 ms 5.5480 ms]
                           thrpt:  [9.0123 Kelem/s 9.0192 Kelem/s 9.0259 
Kelem/s]
                    change:
                           time:   [−1.2486% −1.1486% −1.0503%] (p = 0.00 < 
0.05)
                           thrpt:  [+1.0615% +1.1620% +1.2644%]
                           Performance has improved.
   
   
    Performance counter stats for 
'target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1 
--noplot':
   
            757542529      br_misp_retired.all_branches
           4241897496      int_misc.recovery_cycles
          46949042964      cycles
   
         12.065393734 seconds time elapsed
   
         12.043439000 seconds user
          0.020005000 seconds sys
   
   ```
   
   which is slower than before in terms of cycles, but because we are running 
more iterations now.
   
   if we run without any iterations (no `--bench` so it should only run once):
   ```bash
   $ perf stat  -e br_misp_retired.all_branches,int_misc.recovery_cycles,cycles 
target/profiling/deps/row_format-7489cf224545c7d6 --noplot
   Testing row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...
   Success
   
   
    Performance counter stats for 
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
   
              1364986      br_misp_retired.all_branches
              8496676      int_misc.recovery_cycles
            229098727      cycles
   
          0.060764710 seconds time elapsed
   
          0.050737000 seconds user
          0.010107000 seconds sys
   ```
   
   and for main:
   ```bash
   $ perf stat  -e br_misp_retired.all_branches,int_misc.recovery_cycles,cycles 
target/profiling/deps/row_format-7489cf224545c7d6 --noplot
   Testing row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...
   Success
   
   
    Performance counter stats for 
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
   
              1576148      br_misp_retired.all_branches
              9737113      int_misc.recovery_cycles
            248169052      cycles
   
          0.066125641 seconds time elapsed
   
          0.066172000 seconds user
          0.000000000 seconds sys
   ```
   
   we use fewer cycles. you can verify that it actually run our function by 
using `perf record`
   
   if we run default stats:
   
   for main:
   ```bash
   $ perf stat target/profiling/deps/row_format-7489cf224545c7d6 --noplot
   Testing row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...
   Success
   
   
    Performance counter stats for 
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
   
                68.63 msec task-clock                       #    0.985 CPUs 
utilized
                    1      context-switches                 #   14.571 /sec
                    0      cpu-migrations                   #    0.000 /sec
                 4575      page-faults                      #   66.664 K/sec
            246976894      cycles                           #    3.599 GHz
            561019003      instructions                     #    2.27  insn per 
cycle
             92611997      branches                         #    1.349 G/sec
              1584553      branch-misses                    #    1.71% of all 
branches
   
          0.069667346 seconds time elapsed
   
          0.059826000 seconds user
          0.009944000 seconds sys
   ```
   
   with this improvement:
   ```bash
   $ perf stat target/profiling/deps/row_format-7489cf224545c7d6 --noplot
   Testing row_format/convert_columns 50 columns of StringArray str range: 
1..=31 batch_size: 8192 with no null...
   Success
   
   
    Performance counter stats for 
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
   
                63.59 msec task-clock                       #    0.983 CPUs 
utilized
                    1      context-switches                 #   15.726 /sec
                    0      cpu-migrations                   #    0.000 /sec
                 4069      page-faults                      #   63.987 K/sec
            228843878      cycles                           #    3.599 GHz
            542384326      instructions                     #    2.37  insn per 
cycle
             88943721      branches                         #    1.399 G/sec
              1366042      branch-misses                    #    1.54% of all 
branches
   
          0.064687865 seconds time elapsed
   
          0.064722000 seconds user
          0.000000000 seconds sys
   ```
   
   less branch misses and more instructions per cycle.
   
   so this is a win.
   and the more time and cycles were due to more iterations.
   
   more reason to avoid sort guarantees.
   
   
   </p>
   </details>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to