Hi,

On 10/27/2016 01:44 PM, Amit Kapila wrote:
On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

FWIW I plan to run the same test with logged tables - if it shows similar
regression, I'll be much more worried, because that's a fairly typical
scenario (logged tables, data set > shared buffers), and we surely can't
just go and break that.


Sure, please do those tests.


OK, so I do have results for those tests - that is, scale 3000 with shared_buffers=16GB (so continuously writing out dirty buffers). The following reports show the results slightly differently - all three "tps charts" next to each other, then the speedup charts and tables.

Overall, the results are surprisingly positive - look at these results (all ending with "-retest"):

[1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest

[2] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest

[3] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest

All three show significant improvement, even with fairly low client counts. For example with 72 clients, the tps improves 20%, without significantly affecting variability variability of the results( measured as stdddev, more on this later).

It's however interesting that "no_content_lock" is almost exactly the same as master, while the other two cases improve significantly.

The other interesting thing is that "pgbench -N" [3] shows no such improvement, unlike regular pgbench and Dilip's workload. Not sure why, though - I'd expect to see significant improvement in this case.

I have also repeated those tests with clog buffers increased to 512 (so 4x the current maximum of 128). I only have results for Dilip's workload and "pgbench -N":

[4] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512

[5] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512

The results are somewhat surprising, I guess, because the effect is wildly different for each workload.

For Dilip's workload increasing clog buffers to 512 pretty much eliminates all benefits of the patches. For example with 288 client, the group_update patch gives ~60k tps on 128 buffers [1] but only 42k tps on 512 buffers [4].

With "pgbench -N", the effect is exactly the opposite - while with 128 buffers there was pretty much no benefit from any of the patches [3], with 512 buffers we suddenly get almost 2x the throughput, but only for group_update and master (while the other two patches show no improvement at all).

I don't have results for the regular pgbench ("noskip") with 512 buffers yet, but I'm curious what that will show.

In general I however think that the patches don't show any regression in any of those workloads (at least not with 128 buffers). Based solely on the results, I like the group_update more, because it performs as good as master or significantly better.

2. We do see in some cases that granular_locking and
no_content_lock patches has shown significant increase in
contention on CLOGControlLock. I have already shared my analysis
for same upthread [8].


I've read that analysis, but I'm not sure I see how it explains the "zig zag" behavior. I do understand that shifting the contention to some other (already busy) lock may negatively impact throughput, or that the group_update may result in updating multiple clog pages, but I don't understand two things:

(1) Why this should result in the fluctuations we observe in some of the cases. For example, why should we see 150k tps on, 72 clients, then drop to 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180 clients etc. That seems fairly strange.

(2) Why this should affect all three patches, when only group_update has to modify multiple clog pages.

For example consider this:

    http://tvondra.bitbucket.org/index2.html#dilip-300-logged-async

For example looking at % of time spent on different locks with the group_update patch, I see this (ignoring locks with ~1%):

 event_type     wait_event       36   72  108  144  180  216  252  288
 ---------------------------------------------------------------------
 -              -                60   63   45   53   38   50   33   48
 Client         ClientRead       33   23    9   14    6   10    4    8
 LWLockNamed    CLogControlLock   2    7   33   14   34   14   33   14
 LWLockTranche  buffer_content    0    2    9   13   19   18   26   22

I don't see any sign of contention shifting to other locks, just CLogControlLock fluctuating between 14% and 33% for some reason.

Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's some sort of CPU / OS scheduling artifact. For example, the system has 36 physical cores, 72 virtual ones (thanks to HT). I find it strange that the "good" client counts are always multiples of 72, while the "bad" ones fall in between.

  72 = 72 * 1   (good)
 108 = 72 * 1.5 (bad)
 144 = 72 * 2   (good)
 180 = 72 * 2.5 (bad)
 216 = 72 * 3   (good)
 252 = 72 * 3.5 (bad)
 288 = 72 * 4   (good)

So maybe this has something to do with how OS schedules the tasks, or maybe some internal heuristics in the CPU, or something like that.


On logged tables it usually looks like this (i.e. modest increase for high
client counts at the expense of significantly higher variability):

  http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64


What variability are you referring to in those results?
>

Good question. What I mean by "variability" is how stable the tps is during the benchmark (when measured on per-second granularity). For example, let's run a 10-second benchmark, measuring number of transactions committed each second.

Then all those runs do 1000 tps on average:

  run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
  run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
  run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000

I guess we agree those runs behave very differently, despite having the same throughput. So this is what STDDEV(tps) measures, i.e. the third chart on the reports, shows.

So for example this [6] shows that the patches give us higher throughput with >= 180 clients, but we also pay for that with increased variability of the results (i.e. the tps chart will have jitter):

[6] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-64

Of course, exchanging throughput, latency and variability is one of the crucial trade-offs in transactions systems - at some point the resources get saturated and higher throughput can only be achieved in exchange for latency (e.g. by grouping requests). But still, we'd like to get stable tps from the system, not something that gives us 2000 tps one second and 0 tps the next one.

Of course, this is not perfect - it does not show whether there are transactions with significantly higher latency, and so on. It'd be good to also measure latency, but I haven't collected that info during the runs so far.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to