On 10/12/2016 08:55 PM, Robert Haas wrote: > On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbal...@gmail.com> wrote: >> I think at higher client count from client count 96 onwards contention >> on CLogControlLock is clearly visible and which is completely solved >> with group lock patch. >> >> And at lower client count 32,64 contention on CLogControlLock is not >> significant hence we can not see any gain with group lock patch. >> (though we can see some contention on CLogControlLock is reduced at 64 >> clients.) > > I agree with these conclusions. I had a chance to talk with Andres > this morning at Postgres Vision and based on that conversation I'd > like to suggest a couple of additional tests: > > 1. Repeat this test on x86. In particular, I think you should test on > the EnterpriseDB server cthulhu, which is an 8-socket x86 server. > > 2. Repeat this test with a mixed read-write workload, like -b > tpcb-like@1 -b select-only@9 >
FWIW, I'm already running similar benchmarks on an x86 machine with 72 cores (144 with HT). It's "just" a 4-socket system, but the results I got so far seem quite interesting. The tooling and results (pushed incrementally) are available here: https://bitbucket.org/tvondra/hp05-results/overview The tooling is completely automated, and it also collects various stats, like for example the wait event. So perhaps we could simply run it on ctulhu and get comparable results, and also more thorough data sets than just snippets posted to the list? There's also a bunch of reports for the 5 already completed runs - dilip-300-logged-sync - dilip-300-unlogged-sync - pgbench-300-logged-sync-skip - pgbench-300-unlogged-sync-noskip - pgbench-300-unlogged-sync-skip The name identifies the workload type, scale and whether the tables are wal-logged (for pgbench the "skip" means "-N" while "noskip" does regular pgbench). For example the "reports/wait-events-count-patches.txt" compares the wait even stats with different patches applied (and master): https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default and average tps (from 3 runs, 5 minutes each): https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default There are certainly interesting bits. For example while the "logged" case is dominated y WALWriteLock for most client counts, for large client counts that's no longer true. Consider for example dilip-300-logged-sync results with 216 clients: wait_event | master | gran_lock | no_cont_lock | group_upd --------------------+---------+-----------+--------------+----------- CLogControlLock | 624566 | 474261 | 458599 | 225338 WALWriteLock | 431106 | 623142 | 619596 | 699224 | 331542 | 358220 | 371393 | 537076 buffer_content | 261308 | 134764 | 138664 | 102057 ClientRead | 59826 | 100883 | 103609 | 118379 transactionid | 26966 | 23155 | 23815 | 31700 ProcArrayLock | 3967 | 3852 | 4070 | 4576 wal_insert | 3948 | 10430 | 9513 | 12079 clog | 1710 | 4006 | 2443 | 925 XidGenLock | 1689 | 3785 | 4229 | 3539 tuple | 965 | 617 | 655 | 840 lock_manager | 300 | 571 | 619 | 802 WALBufMappingLock | 168 | 140 | 158 | 147 SubtransControlLock | 60 | 115 | 124 | 105 Clearly, CLOG is an issue here, and it's (slightly) improved by all the patches (group_update performing the best). And with 288 clients (which is 2x the number of virtual cores in the machine, so not entirely crazy) you get this: wait_event | master | gran_lock | no_cont_lock | group_upd --------------------+---------+-----------+--------------+----------- CLogControlLock | 901670 | 736822 | 728823 | 398111 buffer_content | 492637 | 318129 | 319251 | 270416 WALWriteLock | 414371 | 593804 | 589809 | 656613 | 380344 | 452936 | 470178 | 745790 ClientRead | 60261 | 111367 | 111391 | 126151 transactionid | 43627 | 34585 | 35464 | 48679 wal_insert | 5423 | 29323 | 25898 | 30191 ProcArrayLock | 4379 | 3918 | 4006 | 4582 clog | 2952 | 9135 | 5304 | 2514 XidGenLock | 2182 | 9488 | 8894 | 8595 tuple | 2176 | 1288 | 1409 | 1821 lock_manager | 323 | 797 | 827 | 1006 WALBufMappingLock | 124 | 124 | 146 | 206 SubtransControlLock | 85 | 146 | 170 | 120 So even buffer_content gets ahead of the WALWriteLock. I wonder whether this might be because of only having 128 buffers for clog pages, causing contention on this system (surely, systems with 144 cores were not that common when the 128 limit was introduced). So the patch has positive impact even with WAL, as illustrated by tps improvements (for large client counts): clients | master | gran_locking | no_content_lock | group_update ---------+--------+--------------+-----------------+-------------- 36 | 39725 | 39627 | 41203 | 39763 72 | 70533 | 65795 | 65602 | 66195 108 | 81664 | 87415 | 86896 | 87199 144 | 68950 | 98054 | 98266 | 102834 180 | 105741 | 109827 | 109201 | 113911 216 | 62789 | 92193 | 90586 | 98995 252 | 94243 | 102368 | 100663 | 107515 288 | 57895 | 83608 | 82556 | 91738 I find the tps fluctuation intriguing, and I'd like to see that fixed before committing any of the patches. For pgbench-300-logged-sync-skip (the other WAL-logging test already completed), the CLOG contention is also reduced significantly, but the tps did not improve this significantly. For the the unlogged case (dilip-300-unlogged-sync), the results are fairly similar - CLogControlLock and buffer_content dominating the wait even profiles (WALWriteLock is missing, of course), and the average tps fluctuates in almost exactly the same way. Interestingly, no fluctuation for the pgbench tests. For example for pgbench-300-unlogged-sync-ski (i.e. pgbench -N) the result is this: clients | master | gran_locking | no_content_lock | group_update ---------+--------+--------------+-----------------+-------------- 36 | 147265 | 148663 | 148985 | 146559 72 | 162645 | 209070 | 207841 | 204588 108 | 135785 | 219982 | 218111 | 217588 144 | 113979 | 228683 | 228953 | 226934 180 | 96930 | 230161 | 230316 | 227156 216 | 89068 | 224241 | 226524 | 225805 252 | 78203 | 222507 | 225636 | 224810 288 | 63999 | 204524 | 225469 | 220098 That's a fairly significant improvement, and the behavior is very smooth. Sadly, with WAL logging (pgbench-300-logged-sync-ski) the tps drops back to master mostly thanks to WALWriteLock. Another interesting aspect of the patches is impact on variability of results - for example looking at dilip-300-unlogged-sync, the overall average tps (for the three runs combined) and for each of the tree runs, looks like this: clients | avg_tps | tps_1 | tps_2 | tps_3 ---------+---------+-----------+-----------+----------- 36 | 117332 | 115042 | 116125 | 120841 72 | 90917 | 72451 | 119915 | 80319 108 | 96070 | 106105 | 73606 | 108580 144 | 81422 | 71094 | 102109 | 71063 180 | 88537 | 98871 | 67756 | 99021 216 | 75962 | 65584 | 96365 | 66010 252 | 59941 | 57771 | 64756 | 57289 288 | 80851 | 93005 | 56454 | 93313 Notice the variability between the runs - the difference between min and max is often more than 40%. Now compare it to results with the "group-update" patch applied: clients | avg_tps | tps_1 | tps_2 | tps_3 ---------+---------+-----------+-----------+----------- 36 | 116273 | 117031 | 116005 | 115786 72 | 145273 | 147166 | 144839 | 143821 108 | 89892 | 89957 | 89585 | 90133 144 | 130176 | 130310 | 130565 | 129655 180 | 81944 | 81927 | 81951 | 81953 216 | 124415 | 124367 | 123228 | 125651 252 | 76723 | 76467 | 77266 | 76436 288 | 120072 | 121205 | 119731 | 119283 In this case there's pretty much no cross-run variability, the differences are usually within 2%, so basically random noise. (There's of course the variability depending on client count, but that was already mentioned). There's certainly much more interesting stuff in the results, but I don't have time for more thorough analysis now - I only intended to do some "quick benchmarking" on the patch, and I've already spent days on this, and I have other things to do. I'll take care of collecting data for the remaining cases on this machine (and possibly running the same tests on the other one, if I manage to get access to it again). But I'll leave further analysis of the collected data up to the patch authors, or some volunteers. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers