On Thu, 25 Jun 2020 15:35:02 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:
> >> So the reservation is not "just" an add instruction, it's actually an
> >> xadd on x86. Is that really faster than a cmpxchg ?
> >
> > I believe the answer is still yes. But I can run some benchmarks to
> > make sure.
>
> This would be interesting to see, because if xadd and cmpxchg have
> similar overhead, then going for a cmpxchg-loop for the space
> reservation could vastly decrease the overall complexity of this
> timestamp+space reservation algorithm.
So I tested this on three different machines to see how it compared. I
applied this patch:
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c2c0d25ea004..8a7d62375eb3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3161,7 +3161,19 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
/*B*/ rb_time_set(&cpu_buffer->before_stamp, info->ts);
+#if 0
/*C*/ write = local_add_return(info->length, &tail_page->write);
+#else
+ {
+ unsigned long expect, result;
+
+ do {
+ expect = local_read(&tail_page->write);
+ write = expect + info->length;
+ result = local_cmpxchg(&tail_page->write, expect,
write);
+ } while (expect != result);
+ }
+#endif
/* set write to only the index of the write */
write &= RB_WRITE_MASK;
And tested before and after the results.
To test, I used my fedora 31 kernel config, and enabled
CONFIG_TRACEPOINT_BENCHMARK.
When that trace event is enabled, a kthread is kicked off that runs in
a loop (see kernel/trace/trace_benchmark.c) that basically does:
local_irq_disable();
start = trace_clock_local();
trace_benchmark_event();
end = trace_clock_local();
local_irq_enable();
It takes the delta of end - start. As the first time through is cold
cache it records it separately, and after that it creates a string that
lists the delta of the last event recorded (as it can't make the delta
before it records the end time stamp). The first time stamp (the cold
cache one), the max, min, running average. It also records a running
standard deviation. All numbers are in nanoseconds.
The test was simply doing:
# trace-cmd record -e benchmark:benchmark_event -o
trace-tp-bench-{inc|cmpxchg}.dat sleep 3
# trace-cmd report trace-tp-bench-{inc|cmpxchg}.dat | tail
The reason for the tail, is to let the machine settle, as x86 has all
sorts of crazy optimizations there's a lot of noise in the beginning.
Anyway, here's the results of my three machines, and feel free to test
this out yourself.
Going from my fastest to slowest machines:
Machine 1:
model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
(4 core / 8 logical)
# trace-cmd report trace-tp-bench-inc.dat |tail
event_benchmark-2072 [003] 2234.664661 benchmark_event: last=92
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664662 benchmark_event: last=92
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664662 benchmark_event: last=88
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664663 benchmark_event: last=85
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664663 benchmark_event: last=89
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664664 benchmark_event: last=86
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664664 benchmark_event: last=88
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664665 benchmark_event: last=171
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664665 benchmark_event: last=88
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
event_benchmark-2072 [003] 2234.664666 benchmark_event: last=87
first=2794 max=11707 min=81 avg=93 std=49 std^2=2469
# trace-cmd report trace-tp-bench-cmpxchg.dat |tail
event_benchmark-1924 [000] 499.515711 benchmark_event: last=309
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515711 benchmark_event: last=96
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515712 benchmark_event: last=189
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515713 benchmark_event: last=134
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515713 benchmark_event: last=92
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515714 benchmark_event: last=93
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515714 benchmark_event: last=92
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515715 benchmark_event: last=89
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515715 benchmark_event: last=86
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
event_benchmark-1924 [000] 499.515716 benchmark_event: last=107
first=2955 max=10636 min=81 avg=93 std=73 std^2=5367
Machine 2:
model name : Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz
(2 core / 4 logical)
# trace-cmd report trace-tp-bench-inc.dat |tail
event_benchmark-3179 [001] 606.466907: benchmark_event: last=132
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466907: benchmark_event: last=133
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466908: benchmark_event: last=136
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466909: benchmark_event: last=135
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466909: benchmark_event: last=135
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466910: benchmark_event: last=133
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466911: benchmark_event: last=131
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466912: benchmark_event: last=132
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466913: benchmark_event: last=134
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
event_benchmark-3179 [001] 606.466913: benchmark_event: last=223
first=1463 max=1938 min=115 avg=140 std=36 std^2=1341
# trace-cmd report trace-tp-bench-cmpxchg.dat |tail
event_benchmark-3169 [001] 477.024485: benchmark_event: last=102
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024486: benchmark_event: last=103
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024486: benchmark_event: last=131
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024487: benchmark_event: last=104
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024487: benchmark_event: last=181
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024488: benchmark_event: last=195
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024489: benchmark_event: last=155
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024489: benchmark_event: last=102
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024490: benchmark_event: last=166
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
event_benchmark-3169 [001] 477.024490: benchmark_event: last=256
first=1820 max=12210 min=95 avg=266 std=75 std^2=5768
Machine 3:
model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz
( 4 cores / no hyperthreading enabled )
# trace-cmd report trace-tp-bench-inc.dat| tail
event_benchmark-2686 [000] 860.217674: benchmark_event: last=195
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217675: benchmark_event: last=196
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217676: benchmark_event: last=196
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217677: benchmark_event: last=193
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217679: benchmark_event: last=195
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217680: benchmark_event: last=195
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217681: benchmark_event: last=195
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217682: benchmark_event: last=193
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217683: benchmark_event: last=196
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
event_benchmark-2686 [000] 860.217684: benchmark_event: last=195
first=2527 max=4571 min=186 avg=264 std=144 std^2=20943
# trace-cmd report trace-tp-bench-cmpxchg.dat| tail
event_benchmark-2393 [003] 380.342603: benchmark_event: last=199
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342604: benchmark_event: last=196
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342606: benchmark_event: last=201
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342607: benchmark_event: last=195
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342608: benchmark_event: last=195
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342609: benchmark_event: last=198
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342610: benchmark_event: last=199
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342611: benchmark_event: last=196
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342613: benchmark_event: last=196
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
event_benchmark-2393 [003] 380.342614: benchmark_event: last=199
first=2707 max=3215 min=189 avg=297 std=163 std^2=26612
Not that big of a difference, but there is a difference. And the
standard deviation appears to be better, which would make sense as an
local_add_return() needs no repeat on collision.
-- Steve