Re: Add spin_delay() implementation for Arm in s_lock.h
Thanks for all the help Tom! On 4/6/22, 6:07 PM, "Tom Lane" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. "Blake, Geoff" writes: > Hi Tom, Andres, > Any additional feedback for this patch? I did some more research and testing: * Using a Mac with the M1 Pro chip (marginally beefier than the M1 I was testing on before), I think I can see some benefit in the test case I proposed upthread. It's marginal though. * On a Raspberry Pi 3B+, there's no outside-the-noise difference. * ISB doesn't exist in pre-V7 ARM, so it seems prudent to restrict the patch to ARM64. I doubt any flavor of ARM32 would be able to benefit anyway. (Googling finds that MariaDB made this same choice not long ago [1].) So what we've got is that there seems to be benefit at high core counts, and it at least doesn't hurt at lower ones. That's good enough for me, so pushed. regards, tom lane [1] https://jira.mariadb.org/browse/MDEV-25807
Re: Add spin_delay() implementation for Arm in s_lock.h
Hi Tom, Andres, Any additional feedback for this patch? Thanks, Geoff Blake
Re: Add spin_delay() implementation for Arm in s_lock.h
As promised, here is the remaining data: 1 worker, w/o patch: 5236 ms +/- 252ms 1 worker, w/ patch: 5529 ms +/- 168ms 2 worker, w/o patch: 4917 ms +/- 180ms 2 worker, w/ patch: 4745 ms +/- 169ms 4 worker, w/o patch: 6564 ms +/- 336ms 4 worker, w/ patch: 6105 ms +/- 177ms 8 worker, w/o patch: 9575 ms +/- 2375ms 8 worker, w/ patch: 8115 ms +/- 391ms 16 worker, w/o patch: 19367 ms +/- 3543ms 16 worker, w/ patch: 18004 ms +/- 3701ms 32 worker, w/o patch: 101509 ms +/- 22651ms 32 worker, w/ patch: 104234 ms +/- 26821ms 48 worker, w/o patch: 243329 ms +/- 70037ms 48 worker, w/ patch: 189965 ms +/- 79459ms 64 worker, w/o patch: 552443 ms +/- 22841ms 64 worker, w/ patch: 502727 ms +/- 45253ms From this data, on average the patch is beneficial at high worker (CPU) counts tested: 48, 63. At 32 and below the performance is relatively close to each other. Thanks, Geoff
Re: Add spin_delay() implementation for Arm in s_lock.h
Tom, Andres, I spun up a 64-core Graviton2 instance (where I reported seeing improvement with this patch) and ran the provided regression test with and without my proposed on top of mainline PG. I ran 4 runs each of 63 workers where we should see the most contention and most impact from the patch. I am reporting the average and standard deviation, the average with the patch is 10% lower latency, but there is overlap in the standard deviation. I'll gather additional data at lower worker counts and post later to see what the trend is. Cmd: postgres=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 1000, workers); Avg +/- standard dev 63 workers w/o patch: 552443ms +/- 22841ms 63 workers w/ patch: 502727 +/- 45253ms Best results w/o patch: 521216ms w/ patch: 436442ms Thanks, Geoff
Re: Add spin_delay() implementation for Arm in s_lock.h
Tom, Hope everything is well going into the new year. I'd like to pick this discussion back up and your thoughts on the patch with the data I posted 2 weeks prior. Is there more data that would be helpful? Different setup? Data on older versions of Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot algorithm that exhibited quite a bit of synchronization contention? Thanks, Geoff
Re: Add spin_delay() implementation for Arm in s_lock.h
Hi Tom, > What did you test exactly? Tested 3 benchmark configurations on an m6g.16xlarge (Graviton2, 64 cpus, 256GB RAM) I set the scale factor to consume about 1/3 of 256GB and the other parameters in the next line. pgbench setup: -F 90 -s 5622 -c 256 Pgbench select-only w/ patch 662804 tps (-0.5%) w/o patch 666354 tps. tcpb-like w/ patch 35844 tps (0%) w/o patch 35835 tps We also test with Hammerdb when evaluating patches, it shows the patch gets +3%: Hammerdb (192 Warehouse 256 clients) w/ patch 1147463 NOPM (+3%) w/o patch 1112908 NOPM I've run pgbench more than once and the measured TPS values overlap, even though the means on select-only show a small degradation at the moment I am concluding it is noise. On Hammerdb, the results show a measurable difference. Thanks, Geoff