Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

Tomas Vondra Sun, 30 Oct 2016 11:33:39 -0700

Hi,

On 10/27/2016 01:44 PM, Amit Kapila wrote:

On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:


FWIW I plan to run the same test with logged tables - if it shows similar
regression, I'll be much more worried, because that's a fairly typical
scenario (logged tables, data set > shared buffers), and we surely can't
just go and break that.


Sure, please do those tests.

OK, so I do have results for those tests - that is, scale 3000 withshared_buffers=16GB (so continuously writing out dirty buffers). Thefollowing reports show the results slightly differently - all three "tpscharts" next to each other, then the speedup charts and tables.

Overall, the results are surprisingly positive - look at these results(all ending with "-retest"):


[1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest

[2]http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest

[3]http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest

All three show significant improvement, even with fairly low clientcounts. For example with 72 clients, the tps improves 20%, withoutsignificantly affecting variability variability of the results( measuredas stdddev, more on this later).

It's however interesting that "no_content_lock" is almost exactly thesame as master, while the other two cases improve significantly.

The other interesting thing is that "pgbench -N" [3] shows no suchimprovement, unlike regular pgbench and Dilip's workload. Not sure why,though - I'd expect to see significant improvement in this case.

I have also repeated those tests with clog buffers increased to 512 (so4x the current maximum of 128). I only have results for Dilip's workloadand "pgbench -N":

[4]http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512

[5]http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512

The results are somewhat surprising, I guess, because the effect iswildly different for each workload.

For Dilip's workload increasing clog buffers to 512 pretty mucheliminates all benefits of the patches. For example with 288 client, thegroup_update patch gives ~60k tps on 128 buffers [1] but only 42k tps on512 buffers [4].

With "pgbench -N", the effect is exactly the opposite - while with 128buffers there was pretty much no benefit from any of the patches [3],with 512 buffers we suddenly get almost 2x the throughput, but only forgroup_update and master (while the other two patches show no improvementat all).

I don't have results for the regular pgbench ("noskip") with 512 buffersyet, but I'm curious what that will show.

In general I however think that the patches don't show any regression inany of those workloads (at least not with 128 buffers). Based solely onthe results, I like the group_update more, because it performs as goodas master or significantly better.

2. We do see in some cases that granular_locking and
no_content_lock patches has shown significant increase in
contention on CLOGControlLock. I have already shared my analysis
for same upthread [8].

I've read that analysis, but I'm not sure I see how it explains the "zigzag" behavior. I do understand that shifting the contention to someother (already busy) lock may negatively impact throughput, or that thegroup_update may result in updating multiple clog pages, but I don'tunderstand two things:

(1) Why this should result in the fluctuations we observe in some of thecases. For example, why should we see 150k tps on, 72 clients, then dropto 92k with 108 clients, then back to 130k on 144 clients, then 84k on180 clients etc. That seems fairly strange.

(2) Why this should affect all three patches, when only group_update hasto modify multiple clog pages.


For example consider this:

    http://tvondra.bitbucket.org/index2.html#dilip-300-logged-async

For example looking at % of time spent on different locks with thegroup_update patch, I see this (ignoring locks with ~1%):


 event_type     wait_event       36   72  108  144  180  216  252  288
 ---------------------------------------------------------------------
 -              -                60   63   45   53   38   50   33   48
 Client         ClientRead       33   23    9   14    6   10    4    8
 LWLockNamed    CLogControlLock   2    7   33   14   34   14   33   14
 LWLockTranche  buffer_content    0    2    9   13   19   18   26   22

I don't see any sign of contention shifting to other locks, justCLogControlLock fluctuating between 14% and 33% for some reason.

Now, maybe this has nothing to do with PostgreSQL itself, but maybe it'ssome sort of CPU / OS scheduling artifact. For example, the system has36 physical cores, 72 virtual ones (thanks to HT). I find it strangethat the "good" client counts are always multiples of 72, while the"bad" ones fall in between.


  72 = 72 * 1   (good)
 108 = 72 * 1.5 (bad)
 144 = 72 * 2   (good)
 180 = 72 * 2.5 (bad)
 216 = 72 * 3   (good)
 252 = 72 * 3.5 (bad)
 288 = 72 * 4   (good)

So maybe this has something to do with how OS schedules the tasks, ormaybe some internal heuristics in the CPU, or something like that.

On logged tables it usually looks like this (i.e. modest increase for high
client counts at the expense of significantly higher variability):

  http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64


What variability are you referring to in those results?

Good question. What I mean by "variability" is how stable the tps isduring the benchmark (when measured on per-second granularity). Forexample, let's run a 10-second benchmark, measuring number oftransactions committed each second.


Then all those runs do 1000 tps on average:

  run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
  run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
  run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000

I guess we agree those runs behave very differently, despite having thesame throughput. So this is what STDDEV(tps) measures, i.e. the thirdchart on the reports, shows.

So for example this [6] shows that the patches give us higher throughputwith >= 180 clients, but we also pay for that with increased variabilityof the results (i.e. the tps chart will have jitter):

[6]http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-64

Of course, exchanging throughput, latency and variability is one of thecrucial trade-offs in transactions systems - at some point the resourcesget saturated and higher throughput can only be achieved in exchange forlatency (e.g. by grouping requests). But still, we'd like to get stabletps from the system, not something that gives us 2000 tps one second and0 tps the next one.

Of course, this is not perfect - it does not show whether there aretransactions with significantly higher latency, and so on. It'd be goodto also measure latency, but I haven't collected that info during theruns so far.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

Reply via email to