Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

Tomas Vondra Wed, 12 Oct 2016 19:24:07 -0700

On 10/12/2016 08:55 PM, Robert Haas wrote:
> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <[email protected]> wrote:
>> I think at higher client count from client count 96 onwards contention
>> on CLogControlLock is clearly visible and which is completely solved
>> with group lock patch.
>>
>> And at lower client count 32,64  contention on CLogControlLock is not
>> significant hence we can not see any gain with group lock patch.
>> (though we can see some contention on CLogControlLock is reduced at 64
>> clients.)
> 
> I agree with these conclusions.  I had a chance to talk with Andres
> this morning at Postgres Vision and based on that conversation I'd
> like to suggest a couple of additional tests:
> 
> 1. Repeat this test on x86.  In particular, I think you should test on
> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
> 
> 2. Repeat this test with a mixed read-write workload, like -b
> tpcb-like@1 -b select-only@9
>


FWIW, I'm already running similar benchmarks on an x86 machine with 72
cores (144 with HT). It's "just" a 4-socket system, but the results I
got so far seem quite interesting. The tooling and results (pushed
incrementally) are available here:

    https://bitbucket.org/tvondra/hp05-results/overview

The tooling is completely automated, and it also collects various stats,
like for example the wait event. So perhaps we could simply run it on
ctulhu and get comparable results, and also more thorough data sets than
just snippets posted to the list?

There's also a bunch of reports for the 5 already completed runs

 - dilip-300-logged-sync
 - dilip-300-unlogged-sync
 - pgbench-300-logged-sync-skip
 - pgbench-300-unlogged-sync-noskip
 - pgbench-300-unlogged-sync-skip

The name identifies the workload type, scale and whether the tables are
wal-logged (for pgbench the "skip" means "-N" while "noskip" does
regular pgbench).

For example the "reports/wait-events-count-patches.txt" compares the
wait even stats with different patches applied (and master):

https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default

and average tps (from 3 runs, 5 minutes each):

https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default

There are certainly interesting bits. For example while the "logged"
case is dominated y WALWriteLock for most client counts, for large
client counts that's no longer true.

Consider for example dilip-300-logged-sync results with 216 clients:

     wait_event      | master  | gran_lock | no_cont_lock | group_upd
 --------------------+---------+-----------+--------------+-----------
 CLogControlLock     |  624566 |    474261 |       458599 |    225338
 WALWriteLock        |  431106 |    623142 |       619596 |    699224
                     |  331542 |    358220 |       371393 |    537076
 buffer_content      |  261308 |    134764 |       138664 |    102057
 ClientRead          |   59826 |    100883 |       103609 |    118379
 transactionid       |   26966 |     23155 |        23815 |     31700
 ProcArrayLock       |    3967 |      3852 |         4070 |      4576
 wal_insert          |    3948 |     10430 |         9513 |     12079
 clog                |    1710 |      4006 |         2443 |       925
 XidGenLock          |    1689 |      3785 |         4229 |      3539
 tuple               |     965 |       617 |          655 |       840
 lock_manager        |     300 |       571 |          619 |       802
 WALBufMappingLock   |     168 |       140 |          158 |       147
 SubtransControlLock |      60 |       115 |          124 |       105

Clearly, CLOG is an issue here, and it's (slightly) improved by all the
patches (group_update performing the best). And with 288 clients (which
is 2x the number of virtual cores in the machine, so not entirely crazy)
you get this:

     wait_event      | master  | gran_lock | no_cont_lock | group_upd
 --------------------+---------+-----------+--------------+-----------
 CLogControlLock     |  901670 |    736822 |       728823 |    398111
 buffer_content      |  492637 |    318129 |       319251 |    270416
 WALWriteLock        |  414371 |    593804 |       589809 |    656613
                     |  380344 |    452936 |       470178 |    745790
 ClientRead          |   60261 |    111367 |       111391 |    126151
 transactionid       |   43627 |     34585 |        35464 |     48679
 wal_insert          |    5423 |     29323 |        25898 |     30191
 ProcArrayLock       |    4379 |      3918 |         4006 |      4582
 clog                |    2952 |      9135 |         5304 |      2514
 XidGenLock          |    2182 |      9488 |         8894 |      8595
 tuple               |    2176 |      1288 |         1409 |      1821
 lock_manager        |     323 |       797 |          827 |      1006
 WALBufMappingLock   |     124 |       124 |          146 |       206
 SubtransControlLock |      85 |       146 |          170 |       120

So even buffer_content gets ahead of the WALWriteLock. I wonder whether
this might be because of only having 128 buffers for clog pages, causing
contention on this system (surely, systems with 144 cores were not that
common when the 128 limit was introduced).

So the patch has positive impact even with WAL, as illustrated by tps
improvements (for large client counts):

  clients | master | gran_locking | no_content_lock | group_update
 ---------+--------+--------------+-----------------+--------------
       36 |  39725 |        39627 |           41203 |        39763
       72 |  70533 |        65795 |           65602 |        66195
      108 |  81664 |        87415 |           86896 |        87199
      144 |  68950 |        98054 |           98266 |       102834
      180 | 105741 |       109827 |          109201 |       113911
      216 |  62789 |        92193 |           90586 |        98995
      252 |  94243 |       102368 |          100663 |       107515
      288 |  57895 |        83608 |           82556 |        91738

I find the tps fluctuation intriguing, and I'd like to see that fixed
before committing any of the patches.

For pgbench-300-logged-sync-skip (the other WAL-logging test already
completed), the CLOG contention is also reduced significantly, but the
tps did not improve this significantly.

For the the unlogged case (dilip-300-unlogged-sync), the results are
fairly similar - CLogControlLock and buffer_content dominating the wait
even profiles (WALWriteLock is missing, of course), and the average tps
fluctuates in almost exactly the same way.

Interestingly, no fluctuation for the pgbench tests. For example for
pgbench-300-unlogged-sync-ski (i.e. pgbench -N) the result is this:

  clients | master | gran_locking | no_content_lock | group_update
 ---------+--------+--------------+-----------------+--------------
       36 | 147265 |       148663 |          148985 |       146559
       72 | 162645 |       209070 |          207841 |       204588
      108 | 135785 |       219982 |          218111 |       217588
      144 | 113979 |       228683 |          228953 |       226934
      180 |  96930 |       230161 |          230316 |       227156
      216 |  89068 |       224241 |          226524 |       225805
      252 |  78203 |       222507 |          225636 |       224810
      288 |  63999 |       204524 |          225469 |       220098

That's a fairly significant improvement, and the behavior is very
smooth. Sadly, with WAL logging (pgbench-300-logged-sync-ski) the tps
drops back to master mostly thanks to WALWriteLock.

Another interesting aspect of the patches is impact on variability of
results - for example looking at dilip-300-unlogged-sync, the overall
average tps (for the three runs combined) and for each of the tree runs,
looks like this:

  clients | avg_tps |     tps_1 |     tps_2 |    tps_3
 ---------+---------+-----------+-----------+-----------
       36 |  117332 |    115042 |    116125 |    120841
       72 |   90917 |     72451 |    119915 |     80319
      108 |   96070 |    106105 |     73606 |    108580
      144 |   81422 |     71094 |    102109 |     71063
      180 |   88537 |     98871 |     67756 |     99021
      216 |   75962 |     65584 |     96365 |     66010
      252 |   59941 |     57771 |     64756 |     57289
      288 |   80851 |     93005 |     56454 |     93313

Notice the variability between the runs - the difference between min and
max is often more than 40%. Now compare it to results with the
"group-update" patch applied:

  clients | avg_tps |     tps_1 |     tps_2 |    tps_3
 ---------+---------+-----------+-----------+-----------
       36 |  116273 |    117031 |    116005 |    115786
       72 |  145273 |    147166 |    144839 |    143821
      108 |   89892 |     89957 |     89585 |     90133
      144 |  130176 |    130310 |    130565 |    129655
      180 |   81944 |     81927 |     81951 |     81953
      216 |  124415 |    124367 |    123228 |    125651
      252 |   76723 |     76467 |     77266 |     76436
      288 |  120072 |    121205 |    119731 |    119283

In this case there's pretty much no cross-run variability, the
differences are usually within 2%, so basically random noise. (There's
of course the variability depending on client count, but that was
already mentioned).


There's certainly much more interesting stuff in the results, but I
don't have time for more thorough analysis now - I only intended to do
some "quick benchmarking" on the patch, and I've already spent days on
this, and I have other things to do.

I'll take care of collecting data for the remaining cases on this
machine (and possibly running the same tests on the other one, if I
manage to get access to it again). But I'll leave further analysis of
the collected data up to the patch authors, or some volunteers.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

Reply via email to