On 3/12/09 11:28 AM, "Tom Lane" <t...@sss.pgh.pa.us> wrote:

Scott Carey <sc...@richrelevance.com> writes:
> They are not meaningless.  It is certainly more to understand, but the test 
> is entirely valid without that.  In a CPU bound / RAM bound case, as 
> concurrency increases you look for the throughput trend, the %CPU use trend 
> and the context switch rate trend.  More information would be useful but the 
> test is validated by the evidence that it is held up by lock contention.

Er ... *what* evidence?  There might be evidence somewhere that proves
that, but Jignesh hasn't shown it.  The available data suggests that the
first-order performance limiter in this test is something else.
Otherwise it should be possible to max out the performance with a lot
less than 1000 active backends.

                        regards, tom lane

Evidence:

Ramp up the concurrency, measure throughput.  Throughput peaks at X with low 
CPU utilization, linear ramp up until then.   Change lock code.  Throughput 
scales past that point to much higher CPU load.
That's evidence.  Please explain a scenario that proves otherwise.  Your last 
statement above is true but not applicable here.  The test is not 1000 
backends, it lists 1000 users.

There is a key difference between users and backends.  In fact, the evidence is 
that the result can't be backends (the column is labeled users).  If its not 
I/O bound it must cap out at roughly the number of active backends near the 
number of CPU or less,  and as noted it does not.  This isn't proof that there 
is something wrong with the test, its proof that the 1000 number cannot be 
active backends.

I spent a decade solving and tuning CPU scalability problems in CPU/memory 
bound systems.  Sophisticated tests peak at a user count >> CPU count, because 
real users don't execute as fast as possible.  Through a chain of servers 
several layers deep, each tier can have different levels of concurrent 
activity.  Its useful to measure concurrency at each tier, but almost 
impossible in postgres (easy in oracle / mssql).  Most systems have a limited 
thread pool but can queue much more than that number.  Postgres and many 
databases don't do that so clients must via connection pools.  But the result 
behavior of too much concurrency is thrashing and inefficiency - this shows up 
in a test that ramps up concurrency by peak throughput followed by a steep drop 
off in throughput as concurrency goes into the thrashing state.  At this 
thrashing time a lot of context switching and sometimes RAM pressure is a 
typical symptom.

The only way to construct a test that shows the current described behavior 
(linear ramp up, then plateau) is to  have lock contention, I/O bottlenecks, or 
CPU saturation.  The number of users is irrelevant, the trend is the same 
regardless of the relationship between user count and active backend count (0 
delay or 1 second delay, same result different X axis).  If it was an I/O or 
client bottleneck, changing the lock code wouldn't have made it faster.

The evidence is 100% certain that the first test result is limited by locks, 
and that changing them increased throughput.

Reply via email to