On Mon, 19 May 2025 at 09:38, Nathan Bossart <nathandboss...@gmail.com> wrote: > Could you retry your tests on v18devel? It might also be useful to repeat the tests on a variety of hardware to ensure > it's a win across the board.
Hi Nathan, Thanks for your clarification. As you requested, I have performed more tests on different instance types and sizes. In particular, I have run the `test_shm_mq_pipelined` benchmark using Ubuntu 22.04 on m7g.[2,8,16]xlarge and c8g.[2,8.24]xlarge instances with PG master branch (commit: 84914e964b4). Each test has been repeated 30 times and here is the average (in seconds) and the difference from baseline (Master). Graviton3 instances (m7g) results: | Concurrency | Loops | 2xl Master | 2xl No-ISB | 8xl Master | 8xl No-ISB | 16xl Master | 16xl No-ISB | |-------------|----------|------------|--------------|------------|--------------|-------------|----------------| | 1 | 10000000 | 1.9s | 1.9s (1.00x) | 1.9s | 1.8s (1.06x) | 1.7s | 1.8s (0.94x) | | 2 | 10000000 | 2.4s | 2.4s (1.00x) | 2.5s | 2.3s (1.09x) | 2.3s | 2.3s (1.00x) | | 4 | 10000000 | 3.8s | 3.8s (1.00x) | 3.8s | 3.6s (1.06x) | 3.5s | 3.8s (0.92x) | | 8 | 10000000 | 8.9s | 10.3s (0.86x)| 7.5s | 8.6s (0.87x) | 7.8s | 8.9s (0.88x) | | 16 | 10000000 | 21.6s | 22.5s (0.96x)| 22.5s | 23.6s (0.95x)| 21.4s | 24.9s (0.86x) | | 32 | 10000000 | 42.8s | 41.3s (1.04x)| 114.7s | 52.0s (2.21x)| 88.6s | 49.9s (1.78x) | | 64 | 10000000 | 81.8s | 73.3s (1.12x)| 395.9s | 85.2s (4.65x)| 381.3s | 97.0s (3.93x) | | 32 | 100000 | 0.4s | 0.4s (1.00x) | 1.1s | 0.5s (2.20x) | 1.1s | 0.6s (1.83x) | | 64 | 100000 | 0.8s | 0.8s (1.00x) | 3.9s | 0.9s (4.33x) | 3.9s | 1.1s (3.55x) | | 128 | 100000 | 1.6s | 1.5s (1.07x) | 8.5s | 1.9s (4.47x) | 13.3s | 2.0s (6.65x) | | 256 | 100000 | 3.2s | 3.1s (1.03x) | 19.8s | 4.0s (4.95x) | 35.9s | 4.1s (8.76x) | Graviton4 instances (c8g) results: | Concurrency | Loops | 2xl Master | 2xl No-ISB | 8xl Master | 8xl No-ISB | 24xl Master | 24xl No-ISB | |-------------|----------|------------|---------------|------------|---------------|-------------|----------------| | 1 | 10000000 | 1.7s | 1.6s (1.06x) | 1.6s | 1.6s (1.00x) | 1.6s | 1.5s (1.07x) | | 2 | 10000000 | 2.2s | 2.2s (1.00x) | 2.2s | 2.2s (1.00x) | 2.2s | 2.1s (1.05x) | | 4 | 10000000 | 3.4s | 3.5s (0.97x) | 3.5s | 3.4s (1.03x) | 3.5s | 3.4s (1.03x) | | 8 | 10000000 | 10.9s | 13.9s (0.78x) | 8.2s | 9.4s (0.87x) | 7.8s | 8.2s (0.95x) | | 16 | 10000000 | 23.6s | 27.0s (0.87x) | 26.3s | 26.1s (1.01x) | 27.1s | 28.1s (0.96x) | | 32 | 10000000 | 44.6s | 46.9s (0.95x) | 60.6s | 47.7s (1.27x) | 62.1s | 50.4s (1.23x) | | 64 | 10000000 | 81.4s | 81.5s (1.00x) | 189.4s | 91.5s (2.07x) | 176.9s | 101.3s (1.75x) | | 32 | 100000 | 0.5s | 0.5s (1.00x) | 0.6s | 0.5s (1.20x) | 0.6s | 0.5s (1.20x) | | 64 | 100000 | 0.8s | 0.8s (1.00x) | 1.7s | 0.9s (1.89x) | 2.1s | 1.2s (1.75x) | | 128 | 100000 | 1.5s | 1.6s (0.94x) | 4.5s | 1.9s (2.37x) | 7.8s | 2.1s (3.71x) | | 256 | 100000 | 3.3s | 3.1s (1.06x) | 9.7s | 4.1s (2.37x) | 22.0s | 4.5s (4.89x) | We can notice that with low concurrency (1,2,4) results are similar while with medium concurrency (8,16) the No-ISB approach can introduce some regression especially on smaller instances. However, we can see some significant positive performance impact with high concurrency (>=32) settings on large instances (up to 8.76x on m7g.16xl with 256 concurrency).