Re: Add spin_delay() implementation for Arm in s_lock.h

2022-04-08 Thread Blake, Geoff
Thanks for all the help Tom!

On 4/6/22, 6:07 PM, "Tom Lane"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    "Blake, Geoff"  writes:
> Hi Tom, Andres,
> Any additional feedback for this patch?

I did some more research and testing:

* Using a Mac with the M1 Pro chip (marginally beefier than the M1
I was testing on before), I think I can see some benefit in the
test case I proposed upthread.  It's marginal though.

* On a Raspberry Pi 3B+, there's no outside-the-noise difference.

* ISB doesn't exist in pre-V7 ARM, so it seems prudent to restrict
the patch to ARM64.  I doubt any flavor of ARM32 would be able to
benefit anyway.  (Googling finds that MariaDB made this same
choice not long ago [1].)

So what we've got is that there seems to be benefit at high
core counts, and it at least doesn't hurt at lower ones.
That's good enough for me, so pushed.

regards, tom lane

[1] https://jira.mariadb.org/browse/MDEV-25807



Re: Add spin_delay() implementation for Arm in s_lock.h

2022-01-25 Thread Blake, Geoff
Hi Tom, Andres,

Any additional feedback for this patch?

Thanks,
Geoff Blake



Re: Add spin_delay() implementation for Arm in s_lock.h

2022-01-13 Thread Blake, Geoff
As promised, here is the remaining data:

1 worker, w/o patch: 5236 ms +/- 252ms
1 worker, w/   patch: 5529 ms +/- 168ms

2 worker, w/o patch: 4917 ms +/- 180ms
2 worker, w/   patch: 4745 ms +/- 169ms

4 worker, w/o patch: 6564 ms +/- 336ms
4 worker, w/   patch: 6105 ms +/- 177ms

8 worker, w/o patch: 9575 ms +/- 2375ms
8 worker, w/   patch: 8115 ms +/- 391ms

16 worker, w/o patch: 19367 ms +/- 3543ms
16 worker, w/   patch: 18004 ms +/- 3701ms

32 worker, w/o patch: 101509 ms +/- 22651ms
32 worker, w/   patch: 104234 ms +/- 26821ms

48 worker, w/o patch: 243329 ms +/- 70037ms
48 worker, w/   patch: 189965 ms +/- 79459ms

64 worker, w/o patch: 552443 ms +/- 22841ms
64 worker, w/   patch: 502727 ms +/- 45253ms

From this data, on average the patch is beneficial at high worker (CPU) counts 
tested: 48, 63.  At 32 and below the performance is relatively close to each 
other.  

Thanks,
Geoff



Re: Add spin_delay() implementation for Arm in s_lock.h

2022-01-13 Thread Blake, Geoff
Tom, Andres,

I spun up a 64-core Graviton2 instance (where I reported seeing improvement 
with this patch) and ran the provided regression test with and without my 
proposed on top of mainline PG.  I ran 4 runs each of 63 workers where we 
should see the most contention and most impact from the patch.  I am reporting 
the average and standard deviation, the average with the patch is 10% lower 
latency, but there is overlap in the standard deviation.  I'll gather 
additional data at lower worker counts and post later to see what the trend is.

Cmd: postgres=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 1000, workers);

Avg +/- standard dev
63 workers w/o patch: 552443ms +/- 22841ms
63 workers w/   patch: 502727 +/- 45253ms

Best results
w/o patch: 521216ms
w/   patch: 436442ms

Thanks,
Geoff




Re: Add spin_delay() implementation for Arm in s_lock.h

2022-01-03 Thread Blake, Geoff
Tom,

Hope everything is well going into the new year.  I'd like to pick this 
discussion back up and your thoughts on the patch with the data I posted 2 
weeks prior.  Is there more data that would be helpful?  Different setup?  Data 
on older versions of Postgresql to ascertain if it makes more sense on versions 
before the large re-work of the snapshot algorithm that exhibited quite a bit 
of synchronization contention?

Thanks,
Geoff




Re: Add spin_delay() implementation for Arm in s_lock.h

2021-12-13 Thread Blake, Geoff
Hi Tom,

> What did you test exactly?

Tested 3 benchmark configurations on an m6g.16xlarge (Graviton2, 64 cpus, 256GB 
RAM)
I set the scale factor to consume about 1/3 of 256GB and the other parameters 
in the next line.
pgbench setup: -F 90 -s 5622 -c 256
Pgbench select-only w/   patch  662804 tps (-0.5%)
 w/o patch  666354 tps. 
tcpb-like w/   patch  35844 tps (0%)
w/o patch  35835 tps

We also test with Hammerdb when evaluating patches, it shows the patch gets +3%:
Hammerdb (192 Warehouse 256 clients)
w/   patch 1147463 NOPM (+3%)
w/o patch 1112908 NOPM

I've run pgbench more than once and the measured TPS values overlap, even 
though the means on select-only show a small degradation at the moment I am 
concluding it is noise.  On Hammerdb, the results show a measurable difference.

Thanks,
Geoff