rluvaton commented on issue #9083:
URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789605
## Bottleneck 2: Branch misprediction for 50 columns non-nullable string
arrays with string length between 1..=31
> Note: Using non-empty string to avoid the empty string case and giving the
branch predictor the optimal case
> less than 32 bytes so it will be less than block size so it will go to the
mini block path
> and no nulls to focus on a specific code path
<details><summary>Benchmark code:</summary>
<p>
```rust
fn do_bench(c: &mut Criterion, name: &str, cols: Vec<ArrayRef>) {
let fields: Vec<_> = cols
.iter()
.map(|x| SortField::new(x.data_type().clone()))
.collect();
let mut group = c.benchmark_group("row_format");
group.throughput(criterion::Throughput::Elements(cols.len() as u64));
group.bench_function(&format!("convert_columns {name}"), |b| {
b.iter(|| {
let converter = RowConverter::new(fields.clone()).unwrap();
hint::black_box(converter.convert_columns(&cols).unwrap())
});
});
}
struct BinaryGenOptions {
num_cols: usize,
batch_size: usize,
min_len: usize,
max_len: usize,
}
fn run_bench_on_multi_column_with_same_string_data_type_with_no_nulls(
c: &mut Criterion,
opts: BinaryGenOptions
) {
let mut seed = 0;
let mut cols: Vec<ArrayRef> = vec![];
for _ in 0..opts.num_cols {
seed += 1;
cols.push(
Arc::new(create_string_array_with_len_range_and_prefix_and_seed::<i32>(
opts.batch_size, 0.0, opts.min_len, opts.max_len, "", seed,
)) as ArrayRef,
);
}
do_bench(
c,
format!(
"{} columns of StringArray str range: {}..={} batch_size: {} with no
nulls",
opts.num_cols,
opts.min_len,
opts.max_len,
opts.batch_size
)
.as_str(),
cols,
);
}
fn row_bench(c: &mut Criterion) {
run_bench_on_multi_column_with_same_string_data_type_with_no_nulls(
c,
BinaryGenOptions {
batch_size: 8192,
num_cols: 50,
min_len: 1,
max_len: 31,
},
);
}
criterion_group! {
name = benches;
config = Criterion::default().measurement_time(Duration::from_secs(10));
targets = row_bench
}
```
</p>
</details>
Results on main:
```bash
$ cargo bench --features=test_utils --bench row_format --profile=profiling
-- --save-baseline=main
Finished `profiling` profile [optimized + debuginfo] target(s) in 4.74s
Running benches/row_format.rs
(target/profiling/deps/row_format-7489cf224545c7d6)
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.012
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [8.3129 ms 8.3163 ms 8.3200 ms]
thrpt: [6.0096 Kelem/s 6.0123 Kelem/s 6.0148
Kelem/s]
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
```
> [!NOTE]
> TL;DR: the problem here is the CPU branch miss prediction that it fail to
predict the number of mini blocks in each value as it is uniformly randomized
>
> Allowing us to avoid mini-blocks that was done as a way to reduce space
amplification for small string
> by encoding each non-empty string in `<len><slice>` where the length is
the smallest possible type that can represent the string length (explained more
below)
<details><summary>Step by step finding the problem
(<code>Branch_Mispredicts</code>):</summary>
<details><summary>Running Top-Down analysis - Level 1</summary>
<p>
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l1 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.073
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [8.3223 ms 8.3255 ms 8.3288 ms]
thrpt: [6.0033 Kelem/s 6.0057 Kelem/s 6.0079
Kelem/s]
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound % Slots 23.4
BAD Bad_Speculation % Slots 26.5 <==
BE Backend_Bound % Slots 11.8 <
RET Retiring % Slots 38.3 <
MUX % 100.00
Run toplev --describe Bad_Speculation^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Bad_Speculation*/2,+MUX' for breakdown.
```
</p>
</details>
It looks like the problem is `Bad_Speculation`.
<details><summary>Running Top-Down analysis - Level 2</summary>
<p>
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l2 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.133
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [8.3452 ms 8.3481 ms 8.3513 ms]
thrpt: [5.9871 Kelem/s 5.9894 Kelem/s 5.9914
Kelem/s]
change:
time: [+0.2160% +0.2718% +0.3277%] (p = 0.00 <
0.05)
thrpt: [−0.3267% −0.2710% −0.2156%]
Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) high mild
4 (4.00%) high severe
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound % Slots
23.0 [14.0%]
BAD Bad_Speculation % Slots
25.9 [14.0%]
BE Backend_Bound % Slots
12.0 < [14.0%]
RET Retiring % Slots
39.1 < [14.0%]
FE Frontend_Bound.Fetch_Latency % Slots
9.2 < [14.0%]
FE Frontend_Bound.Fetch_Bandwidth % Slots
13.8 < [14.0%]
BAD Bad_Speculation.Branch_Mispredicts % Slots
25.9 [14.0%]<==
BAD Bad_Speculation.Machine_Clears % Slots
0.0 < [14.0%]
BE/Mem Backend_Bound.Memory_Bound % Slots
4.0 < [14.0%]
BE/Core Backend_Bound.Core_Bound % Slots
8.1 < [14.0%]
RET Retiring.Light_Operations % Slots
36.9 < [14.0%]
RET Retiring.Heavy_Operations % Slots
2.2 < [14.0%]
MUX %
14.00
Run toplev --describe Branch_Mispredicts^ to get more information on
bottleneck
Add --run-sample to find locations
Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
```
</p>
</details>
it is `Branch_Mispredicts`
<details><summary>Running Top-Down analysis - Level 3</summary>
<p>
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l3 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.123
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [8.3599 ms 8.3622 ms 8.3647 ms]
thrpt: [5.9775 Kelem/s 5.9793 Kelem/s 5.9809
Kelem/s]
change:
time: [+0.1217% +0.1692% +0.2131%] (p = 0.00 <
0.05)
thrpt: [−0.2126% −0.1689% −0.1215%]
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound %
Slots 23.0 [ 3.0%]
BAD Bad_Speculation.Branch_Mispredicts.Other_Mispredicts %
Slots 0.3 < [ 3.0%]
BAD Bad_Speculation.Machine_Clears.Other_Nukes %
Slots 0.0 <
BE/Core Backend_Bound.Core_Bound.Serializing_Operation %
Clocks 0.0 < [ 3.0%]
BAD Bad_Speculation %
Slots 26.0 [ 3.0%]
BE Backend_Bound %
Slots 12.1 < [ 3.0%]
RET Retiring %
Slots 38.8 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency %
Slots 9.1 < [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth %
Slots 13.8 < [ 3.0%]
BAD Bad_Speculation.Branch_Mispredicts %
Slots 26.0 [ 3.0%]<==
BAD Bad_Speculation.Machine_Clears %
Slots 0.0 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound %
Slots 4.0 < [ 3.0%]
BE/Core Backend_Bound.Core_Bound %
Slots 8.1 < [ 3.0%]
RET Retiring.Light_Operations %
Slots 36.6 < [ 3.0%]
RET Retiring.Heavy_Operations %
Slots 2.3 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.ICache_Misses %
Clocks 0.1 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.ITLB_Misses %
Clocks 0.0 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.Branch_Resteers %
Clocks 6.4 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.MS_Switches %
Clocks_est 0.3 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.LCP %
Clocks 0.0 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.DSB_Switches %
Clocks 8.3 < [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth.MITE %
Slots_est 9.3 < [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth.DSB %
Slots_est 6.9 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound %
Stalls 9.1 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L2_Bound %
Stalls 0.4 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L3_Bound %
Stalls 1.3 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound %
Stalls 0.7 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.Store_Bound %
Stalls 4.0 < [ 3.0%]
BE/Core Backend_Bound.Core_Bound.Divider %
Clocks 0.0 < [ 3.0%]
BE/Core Backend_Bound.Core_Bound.Ports_Utilization %
Clocks 15.6 < [ 3.0%]
RET Retiring.Light_Operations.FP_Arith %
Uops 0.5 < [ 3.0%]
RET Retiring.Light_Operations.Memory_Operations %
Slots 13.9 < [ 3.0%]
RET Retiring.Light_Operations.Fused_Instructions %
Slots 5.9 < [ 3.0%]
RET Retiring.Light_Operations.Non_Fused_Branches %
Slots 1.8 < [ 3.0%]
RET Retiring.Light_Operations.Other_Light_Ops %
Slots 14.5 < [ 3.0%]
RET Retiring.Heavy_Operations.Few_Uops_Instructions %
Slots 2.1 < [ 3.0%]
RET Retiring.Heavy_Operations.Microcode_Sequencer %
Slots 0.1 < [ 3.0%]
MUX %
3.00
Run toplev --describe Branch_Mispredicts^ to get more information on
bottleneck
Add --run-sample to find locations
Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
```
</p>
</details>
<details><summary>Recording using perf the misprediction and all
branches</summary>
<p>
Running perf record to see where most mispredictions while showing
(mispredictions, `all_branches`)
```bash
$ perf record -e
'{br_misp_retired.all_branches,br_inst_retired.all_branches}:pp' -g
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1
--noplot
```
looking at the report:
```bash
$ perf report -g --group --percent-limit 0.01 --no-children
```
</p>
</details>
the most expensive in term of misprediction is
`arrow_row::variable::encode_one`:
```bash
+ 67.00% 43.93% row_format-7489 row_format-7489cf224545c7d6 [.]
arrow_row::variable::encode_one
```
<details><summary>annotated info</summary>
<p>
```asm
Percent │ arrow_row::variable::encode_blocks:
▒
│ mov %ecx,%r14d
▒
│ and $0x7,%r14d
▒
│ <core::slice::iter::ChunksExact<T> as
core::iter::traits::iterator::Iterator>::size_hint:
▒
│ mov %rcx,%rax
▒
│ shr $0x3,%rax
▒
│ core::cmp::Ord::min:
▒
│ cmp %rax,%r9
▒
│ cmovb %r9,%rax
▒
│ <core::iter::adapters::zip::Zip<A,B> as
core::iter::adapters::zip::ZipImpl<A,B>>::next:
◆
3.08 3.11 │ test %rax,%rax
▒
26.75 0.78 │ ↓ je aa
▒
│ arrow_row::variable::encode_blocks:
▒
│ let remainder = chunks.remainder();
▒
│ for (input, output) in
chunks.clone().zip(to_write.chunks_exact_mut(SIZE + 1)) {
▒
│ let input: &[u8; SIZE] = input.try_into().unwrap();
▒
│ let out_block: &mut [u8; SIZE] = (&mut
output[..SIZE]).try_into().unwrap();
▒
│
▒
│ *out_block = *input;
▒
│ mov (%rdx),%r9
▒
│ mov %r9,0x1(%rdi)
▒
│
▒
│ // Indicate that there are further blocks to follow
▒
│ output[SIZE] = BLOCK_CONTINUATION;
▒
│ movb $0xff,0x9(%rdi)
▒
│ <core::iter::adapters::zip::Zip<A,B> as
core::iter::adapters::zip::ZipImpl<A,B>>::next:
▒
6.92 5.40 │ ┌──cmp $0x1,%rax
▒
26.92 2.92 │ ├──je aa
▒
│ │arrow_row::variable::encode_blocks:
▒
│ │*out_block = *input;
▒
│ │ mov 0x8(%rdx),%r9
▒
│ │ mov %r9,0xa(%rdi)
▒
│ │output[SIZE] = BLOCK_CONTINUATION;
▒
│ │ movb $0xff,0x12(%rdi)
▒
│ │<core::iter::adapters::zip::Zip<A,B> as
core::iter::adapters::zip::ZipImpl<A,B>>::next:
▒
13.33 3.66 │ │ cmp $0x2,%rax
▒
16.83 3.57 │ │↓ je aa
▒
│ │arrow_row::variable::encode_blocks:
▒
│ │*out_block = *input;
▒
│ │ mov 0x10(%rdx),%r9
▒
│ │ mov %r9,0x13(%rdi)
▒
│ │output[SIZE] = BLOCK_CONTINUATION;
▒
│ │ movb $0xff,0x1b(%rdi)
▒
│ │<core::iter::adapters::zip::Zip<A,B> as
core::iter::adapters::zip::ZipImpl<A,B>>::next:
▒
│ │ cmp $0x3,%rax
▒
0.00 5.24 │ │↓ je aa
▒
│ │arrow_row::variable::encode_blocks:
▒
│ │*out_block = *input;
▒
│ │ mov 0x18(%rdx),%rax
▒
│ │ mov %rax,0x1c(%rdi)
▒
│ │output[SIZE] = BLOCK_CONTINUATION;
▒
│ │ movb $0xff,0x24(%rdi)
▒
│ │arrow_row::variable::encode_one:
▒
│ │1 + encode_blocks::<MINI_BLOCK_SIZE>(&mut out[1..], val)
▒
```
</p>
</details>
</details>
we see that the misprediction is on the iterator of the blocks in
`variable::encode_blocks` function
```rust
for (input, output) in chunks.clone().zip(to_write.chunks_exact_mut(SIZE +
1)) {
```
changing the code to avoid the blocks entirely
(as they are a way to have a slice of unsized bytes without prefix length so
sorting will work on different lengths bytes)
by encoding the length at the start (in the smallest type possible to avoid
a lot of space for small string)
and then having the slice as is:
<details><summary>updated code:</summary>
<p>
```rust
/// Faster encode_blocks that first copy all the data and then iterate over
it and
#[inline]
fn fast_encode_bytes(out: &mut [u8], val: &[u8], opts: SortOptions) -> usize
{
// Write `2_u8` to demarcate as non-empty, non-null string
out[0] = NON_EMPTY_SENTINEL;
// TODO - in desc should do max minus the length so the order will be
different (longer strings sort before shorter ones)
let start_data_offset = {
let len = val.len();
match get_number_of_bits_needed_to_encode(len) {
0 => {
out[0] = EMPTY_SENTINEL;
return 1;
}
1 => {
out[0] |= LENGTH_TYPE_U8;
// encode length
let start_data_offset = 1 + size_of::<u8>();
unsafe { out.get_unchecked_mut(1..start_data_offset)
}.copy_from_slice(&(len as u8).to_be_bytes());
start_data_offset
}
2 => {
out[0] |= LENGTH_TYPE_U16;
// encode length
let start_data_offset = 1 + size_of::<u16>();
unsafe { out.get_unchecked_mut(1..start_data_offset)
}.copy_from_slice(&(len as u16).to_be_bytes());
start_data_offset
}
4 => {
out[0] |= LENGTH_TYPE_U32;
// encode length
let start_data_offset = 1 + size_of::<u32>();
unsafe { out.get_unchecked_mut(1..start_data_offset)
}.copy_from_slice(&(len as u32).to_be_bytes());
start_data_offset
}
8 => {
out[0] |= LENGTH_TYPE_U64;
// encode length
let start_data_offset = 1 + size_of::<u64>();
unsafe { out.get_unchecked_mut(1..start_data_offset)
}.copy_from_slice(&(len as u64).to_be_bytes());
start_data_offset
}
bits_required => {
unreachable!("invalid length type {len}. numbr of bits
required {bits_required}");
}
}
};
let len = start_data_offset + val.len();
out[start_data_offset..len].copy_from_slice(val);
len
}
```
</p>
</details>
We will have 33% time improvement.
```bash
$ cargo bench --features=test_utils --bench row_format --profile=profiling
-- --baseline=main
Finished `profiling` profile [optimized + debuginfo] target(s) in 4.69s
Running benches/row_format.rs
(target/profiling/deps/row_format-7489cf224545c7d6)
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.059
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [5.5475 ms 5.5514 ms 5.5556 ms]
thrpt: [9.0000 Kelem/s 9.0067 Kelem/s 9.0131
Kelem/s]
change:
time: [−33.301% −33.246% −33.187%] (p = 0.00 <
0.05)
thrpt: [+49.672% +49.804% +49.926%]
Performance has improved.
```
the type of the slice can be put at the metadata at the beginning of the
rows if performance is better (I don't think so).
<details><summary>Running Top-Down analysis again on the updated
code:</summary>
<p>
and if we run top level again for level 1:
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l1 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.018
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [5.5393 ms 5.5431 ms 5.5469 ms]
thrpt: [9.0140 Kelem/s 9.0203 Kelem/s 9.0264
Kelem/s]
change:
time: [−34.268% −34.217% −34.168%] (p = 0.00 <
0.05)
thrpt: [+51.901% +52.016% +52.134%]
Performance has improved.
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound % Slots 20.7
BAD Bad_Speculation % Slots 21.5 <==
BE Backend_Bound % Slots 14.6 <
RET Retiring % Slots 43.3 <
MUX % 100.00
Run toplev --describe Bad_Speculation^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Bad_Speculation*/2,+MUX' for breakdown.
```
and level 2:
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l2 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.082
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [5.5274 ms 5.5308 ms 5.5344 ms]
thrpt: [9.0344 Kelem/s 9.0403 Kelem/s 9.0459
Kelem/s]
change:
time: [−0.3163% −0.2212% −0.1278%] (p = 0.00 <
0.05)
thrpt: [+0.1280% +0.2217% +0.3173%]
Change within noise threshold.
Found 18 outliers among 100 measurements (18.00%)
18 (18.00%) high mild
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound % Slots
20.7 [14.0%]
BAD Bad_Speculation % Slots
21.5 [14.0%]
BE Backend_Bound % Slots
14.5 < [14.0%]
RET Retiring % Slots
43.3 < [14.0%]
FE Frontend_Bound.Fetch_Latency % Slots
10.6 [14.0%]
FE Frontend_Bound.Fetch_Bandwidth % Slots
10.1 < [14.0%]
BAD Bad_Speculation.Branch_Mispredicts % Slots
21.5 [14.0%]<==
BAD Bad_Speculation.Machine_Clears % Slots
0.0 < [14.0%]
BE/Mem Backend_Bound.Memory_Bound % Slots
5.6 < [14.0%]
BE/Core Backend_Bound.Core_Bound % Slots
8.9 < [14.0%]
RET Retiring.Light_Operations % Slots
39.9 < [14.0%]
RET Retiring.Heavy_Operations % Slots
3.4 < [14.0%]
MUX %
14.00
Run toplev --describe Branch_Mispredicts^ to get more information on
bottleneck
Add --run-sample to find locations
Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
```
and level 3:
```bash
$ ~/pmu-tools/toplev.py --core S0-C0 -l3 --single-thread -v --no-desc
taskset -c 0 target/profiling/deps/row_format-7489cf224545c7d6 --bench
--warm-up-time 1 --noplot
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.003
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [5.5268 ms 5.5302 ms 5.5337 ms]
thrpt: [9.0355 Kelem/s 9.0413 Kelem/s 9.0468
Kelem/s]
change:
time: [−0.1050% −0.0117% +0.0828%] (p = 0.80 >
0.05)
thrpt: [−0.0827% +0.0117% +0.1051%]
No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
15 (15.00%) high mild
# 5.01-full-perf on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
[clx/skylake]
FE Frontend_Bound %
Slots 20.8 [ 3.0%]
BAD Bad_Speculation.Branch_Mispredicts.Other_Mispredicts %
Slots 0.0 < [ 3.0%]
BAD Bad_Speculation.Machine_Clears.Other_Nukes %
Slots 0.0 <
BE/Core Backend_Bound.Core_Bound.Serializing_Operation %
Clocks 0.0 < [ 3.0%]
BAD Bad_Speculation %
Slots 21.6 [ 3.0%]
BE Backend_Bound %
Slots 14.7 < [ 3.0%]
RET Retiring %
Slots 43.0 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency %
Slots 10.6 [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth %
Slots 10.3 [ 3.0%]
BAD Bad_Speculation.Branch_Mispredicts %
Slots 21.6 [ 3.0%]<==
BAD Bad_Speculation.Machine_Clears %
Slots 0.0 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound %
Slots 5.7 < [ 3.0%]
BE/Core Backend_Bound.Core_Bound %
Slots 9.0 < [ 3.0%]
RET Retiring.Light_Operations %
Slots 39.6 < [ 3.0%]
RET Retiring.Heavy_Operations %
Slots 3.4 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.ICache_Misses %
Clocks 0.1 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.ITLB_Misses %
Clocks 0.0 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.Branch_Resteers %
Clocks 5.4 [ 3.0%]
FE Frontend_Bound.Fetch_Latency.MS_Switches %
Clocks_est 0.3 < [ 3.0%]
FE Frontend_Bound.Fetch_Latency.LCP %
Clocks 0.0 [ 3.0%]
FE Frontend_Bound.Fetch_Latency.DSB_Switches %
Clocks 6.2 [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth.MITE %
Slots_est 7.7 < [ 3.0%]
FE Frontend_Bound.Fetch_Bandwidth.DSB %
Slots_est 4.1 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound %
Stalls 12.9 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L2_Bound %
Stalls 0.5 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.L3_Bound %
Stalls 1.4 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound %
Stalls 0.5 < [ 3.0%]
BE/Mem Backend_Bound.Memory_Bound.Store_Bound %
Stalls 4.8 < [ 3.0%]
BE/Core Backend_Bound.Core_Bound.Divider %
Clocks 0.0 < [ 3.0%]
BE/Core Backend_Bound.Core_Bound.Ports_Utilization %
Clocks 22.1 < [ 3.0%]
RET Retiring.Light_Operations.FP_Arith %
Uops 0.4 < [ 3.0%]
RET Retiring.Light_Operations.Memory_Operations %
Slots 15.6 < [ 3.0%]
RET Retiring.Light_Operations.Fused_Instructions %
Slots 5.1 < [ 3.0%]
RET Retiring.Light_Operations.Non_Fused_Branches %
Slots 3.0 < [ 3.0%]
RET Retiring.Light_Operations.Other_Light_Ops %
Slots 15.4 < [ 3.0%]
RET Retiring.Heavy_Operations.Few_Uops_Instructions %
Slots 3.3 < [ 3.0%]
RET Retiring.Heavy_Operations.Microcode_Sequencer %
Slots 0.1 < [ 3.0%]
MUX %
3.00
Run toplev --describe Branch_Mispredicts^ to get more information on
bottleneck
Add --run-sample to find locations
Add --nodes '!+Branch_Mispredicts*/3,+MUX' for breakdown.
```
We are still branch misprediction bounded but much less.
Lets see what was mispredicted:
```bash
$ perf record -e
'{br_misp_retired.all_branches,br_inst_retired.all_branches}:pp' -g
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1
--noplot
$ perf report -g --group --percent-limit 0.01 --no-children
```
we are now seeing this:
```bash
Samples: 43K of events 'anon group { br_misp_retired.all_branches,
br_inst_retired.all_branches }', Event count (approx.): 17292789079
Overhead Command Shared Object Symbol
+ 97.68% 23.41% row_format-7489 libc.so.6 [.]
__memmove_evex_unaligned_erms
+ 1.80% 0.71% row_format-7489 row_format-7489cf224545c7d6 [.]
core::slice::sort::shared::smallsort::small_sort_network
+ 0.21% 33.24% row_format-7489 row_format-7489cf224545c7d6 [.]
arrow_row::variable::fast_encode_bytes
+ 0.11% 0.41% row_format-7489 row_format-7489cf224545c7d6 [.]
core::slice::sort::unstable::quicksort::quicksort
+ 0.11% 0.26% row_format-7489 row_format-7489cf224545c7d6 [.]
<alloc::string::String as core::iter::traits::collect::Extend<char>>::extend
+ 0.05% 24.94% row_format-7489 row_format-7489cf224545c7d6 [.]
arrow_row::row_lengths
+ 0.05% 0.30% row_format-7489 row_format-7489cf224545c7d6 [.]
criterion::stats::univariate::mixed::bootstrap
```
Running stats:
```bash
$ perf stat -e br_misp_retired.all_branches,int_misc.recovery_cycles,cycles
target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1
--noplot
Benchmarking row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...: Collecting 100 samples in estimated
10.108
row_format/convert_columns 50 columns of StringArray str range: 1..=31
batch_size: 8192 with no null...
time: [5.5396 ms 5.5437 ms 5.5480 ms]
thrpt: [9.0123 Kelem/s 9.0192 Kelem/s 9.0259
Kelem/s]
change:
time: [−1.2486% −1.1486% −1.0503%] (p = 0.00 <
0.05)
thrpt: [+1.0615% +1.1620% +1.2644%]
Performance has improved.
Performance counter stats for
'target/profiling/deps/row_format-7489cf224545c7d6 --bench --warm-up-time 1
--noplot':
757542529 br_misp_retired.all_branches
4241897496 int_misc.recovery_cycles
46949042964 cycles
12.065393734 seconds time elapsed
12.043439000 seconds user
0.020005000 seconds sys
```
which is slower than before in terms of cycles, but because we are running
more iterations now.
if we run without any iterations (no `--bench` so it should only run once):
```bash
$ perf stat -e br_misp_retired.all_branches,int_misc.recovery_cycles,cycles
target/profiling/deps/row_format-7489cf224545c7d6 --noplot
Testing row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...
Success
Performance counter stats for
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
1364986 br_misp_retired.all_branches
8496676 int_misc.recovery_cycles
229098727 cycles
0.060764710 seconds time elapsed
0.050737000 seconds user
0.010107000 seconds sys
```
and for main:
```bash
$ perf stat -e br_misp_retired.all_branches,int_misc.recovery_cycles,cycles
target/profiling/deps/row_format-7489cf224545c7d6 --noplot
Testing row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...
Success
Performance counter stats for
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
1576148 br_misp_retired.all_branches
9737113 int_misc.recovery_cycles
248169052 cycles
0.066125641 seconds time elapsed
0.066172000 seconds user
0.000000000 seconds sys
```
we use fewer cycles. you can verify that it actually run our function by
using `perf record`
if we run default stats:
for main:
```bash
$ perf stat target/profiling/deps/row_format-7489cf224545c7d6 --noplot
Testing row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...
Success
Performance counter stats for
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
68.63 msec task-clock # 0.985 CPUs
utilized
1 context-switches # 14.571 /sec
0 cpu-migrations # 0.000 /sec
4575 page-faults # 66.664 K/sec
246976894 cycles # 3.599 GHz
561019003 instructions # 2.27 insn per
cycle
92611997 branches # 1.349 G/sec
1584553 branch-misses # 1.71% of all
branches
0.069667346 seconds time elapsed
0.059826000 seconds user
0.009944000 seconds sys
```
with this improvement:
```bash
$ perf stat target/profiling/deps/row_format-7489cf224545c7d6 --noplot
Testing row_format/convert_columns 50 columns of StringArray str range:
1..=31 batch_size: 8192 with no null...
Success
Performance counter stats for
'target/profiling/deps/row_format-7489cf224545c7d6 --noplot':
63.59 msec task-clock # 0.983 CPUs
utilized
1 context-switches # 15.726 /sec
0 cpu-migrations # 0.000 /sec
4069 page-faults # 63.987 K/sec
228843878 cycles # 3.599 GHz
542384326 instructions # 2.37 insn per
cycle
88943721 branches # 1.399 G/sec
1366042 branch-misses # 1.54% of all
branches
0.064687865 seconds time elapsed
0.064722000 seconds user
0.000000000 seconds sys
```
less branch misses and more instructions per cycle.
so this is a win.
and the more time and cycles were due to more iterations.
more reason to avoid sort guarantees.
</p>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]