Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Salvatore Dipietro Thu, 19 Jun 2025 12:11:25 -0700

On Mon, 19 May 2025 at 09:38, Nathan Bossart <[email protected]> wrote:
 > Could you retry your tests on v18devel?  It might also be useful to
repeat the tests on a variety of hardware to ensure
> it's a win across the board.



Hi Nathan,
Thanks for your clarification. As you requested, I have performed more
tests on different instance types and sizes.
In particular, I have run the `test_shm_mq_pipelined` benchmark using
Ubuntu 22.04 on m7g.[2,8,16]xlarge and
c8g.[2,8.24]xlarge instances with PG master branch (commit:
84914e964b4). Each test has been repeated 30 times
and here is the average (in seconds) and the difference from baseline (Master).

Graviton3 instances (m7g) results:
| Concurrency | Loops    | 2xl Master | 2xl No-ISB   | 8xl Master |
8xl No-ISB   | 16xl Master | 16xl No-ISB    |
|-------------|----------|------------|--------------|------------|--------------|-------------|----------------|
| 1           | 10000000 | 1.9s       | 1.9s (1.00x) | 1.9s       |
1.8s (1.06x) | 1.7s        | 1.8s (0.94x)   |
| 2           | 10000000 | 2.4s       | 2.4s (1.00x) | 2.5s       |
2.3s (1.09x) | 2.3s        | 2.3s (1.00x)   |
| 4           | 10000000 | 3.8s       | 3.8s (1.00x) | 3.8s       |
3.6s (1.06x) | 3.5s        | 3.8s (0.92x)   |
| 8           | 10000000 | 8.9s       | 10.3s (0.86x)| 7.5s       |
8.6s (0.87x) | 7.8s        | 8.9s (0.88x)   |
| 16          | 10000000 | 21.6s      | 22.5s (0.96x)| 22.5s      |
23.6s (0.95x)| 21.4s       | 24.9s (0.86x)  |
| 32          | 10000000 | 42.8s      | 41.3s (1.04x)| 114.7s     |
52.0s (2.21x)| 88.6s       | 49.9s (1.78x)  |
| 64          | 10000000 | 81.8s      | 73.3s (1.12x)| 395.9s     |
85.2s (4.65x)| 381.3s      | 97.0s (3.93x)  |
| 32          | 100000   | 0.4s       | 0.4s (1.00x) | 1.1s       |
0.5s (2.20x) | 1.1s        | 0.6s (1.83x)   |
| 64          | 100000   | 0.8s       | 0.8s (1.00x) | 3.9s       |
0.9s (4.33x) | 3.9s        | 1.1s (3.55x)   |
| 128         | 100000   | 1.6s       | 1.5s (1.07x) | 8.5s       |
1.9s (4.47x) | 13.3s       | 2.0s (6.65x)   |
| 256         | 100000   | 3.2s       | 3.1s (1.03x) | 19.8s      |
4.0s (4.95x) | 35.9s       | 4.1s (8.76x)   |

Graviton4 instances (c8g) results:
| Concurrency | Loops    | 2xl Master | 2xl No-ISB    | 8xl Master |
8xl No-ISB    | 24xl Master | 24xl No-ISB    |
|-------------|----------|------------|---------------|------------|---------------|-------------|----------------|
| 1           | 10000000 | 1.7s       | 1.6s (1.06x)  | 1.6s       |
1.6s (1.00x)  | 1.6s        | 1.5s (1.07x)   |
| 2           | 10000000 | 2.2s       | 2.2s (1.00x)  | 2.2s       |
2.2s (1.00x)  | 2.2s        | 2.1s (1.05x)   |
| 4           | 10000000 | 3.4s       | 3.5s (0.97x)  | 3.5s       |
3.4s (1.03x)  | 3.5s        | 3.4s (1.03x)   |
| 8           | 10000000 | 10.9s      | 13.9s (0.78x) | 8.2s       |
9.4s (0.87x)  | 7.8s        | 8.2s (0.95x)   |
| 16          | 10000000 | 23.6s      | 27.0s (0.87x) | 26.3s      |
26.1s (1.01x) | 27.1s       | 28.1s (0.96x)  |
| 32          | 10000000 | 44.6s      | 46.9s (0.95x) | 60.6s      |
47.7s (1.27x) | 62.1s       | 50.4s (1.23x)  |
| 64          | 10000000 | 81.4s      | 81.5s (1.00x) | 189.4s     |
91.5s (2.07x) | 176.9s      | 101.3s (1.75x) |
| 32          | 100000   | 0.5s       | 0.5s (1.00x)  | 0.6s       |
0.5s (1.20x)  | 0.6s        | 0.5s (1.20x)   |
| 64          | 100000   | 0.8s       | 0.8s (1.00x)  | 1.7s       |
0.9s (1.89x)  | 2.1s        | 1.2s (1.75x)   |
| 128         | 100000   | 1.5s       | 1.6s (0.94x)  | 4.5s       |
1.9s (2.37x)  | 7.8s        | 2.1s (3.71x)   |
| 256         | 100000   | 3.3s       | 3.1s (1.06x)  | 9.7s       |
4.1s (2.37x)  | 22.0s       | 4.5s (4.89x)   |


We can notice that with low concurrency (1,2,4) results are similar
while with medium concurrency (8,16)
the No-ISB approach can introduce some regression especially on
smaller instances. However, we can see some significant
positive performance impact with high concurrency (>=32) settings on
large instances (up to 8.76x on m7g.16xl with 256 concurrency).

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Reply via email to