On 3/12/09 1:35 PM, "Greg Smith" <gsm...@gregsmith.com> wrote:

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

> As soon as I get more "cycles" I will try variations of it but it would
> help if others can try it out in their own environments to see if it
> helps their instances.

What you should do next is see whether you can remove the bottleneck your
test is running into via using a connection pooler.

I doubt it is running into a bottleneck due to that, the symptoms aren't right. 
 He can change his test to have near zero delay to simulate such a connection 
pool.

If it was an issue due to concurrency at that level, the results would not have 
scaled linearly with user count to a plateau the way they did.  There would be 
a steep drop-down from thrashing as concurrency kept going up.  Context switch 
data would help, since the thrashing ends up as a measurable there.  No 
evidence of concurrency thrashing yet that I see, but more tests and data would 
help.

The disconnect, is that the Users column in his data does not represent 
back-ends.  It represents concurrent users on the front-end.  Whether these 
while idle pool or not is not clear.  It would be useful to rule that 
possibility out but that looks like an improbable diagnosis to me given the 
lack of performance decrease as concurrency goes up.
Furthermore, if the problem was due to too much concurrency in the database 
with active connections, its hard to see how changing the lock code would 
change the result the way it did - increasing CPU and throughput accordingly.  
Again, context switch rate info would help rule out many possibilities.

That's what I think
most informed people would do were you to ask how to setup an optimal
environment using PostgreSQL that aimed to serve thousands of clients.
If that makes your bottleneck go away, that's what you should be
recommending to customers who want to scale in this fashion too.

First just run a test with a tiny delay (5ms? 0?) and fewer users to compare.  
If your theory that a connection pooler would help, that test would provide 
higher throughput with low user count and not be lock limited.  This may be 
easier to run than setting up a pooler, though he should investigate one 
regardless.

If the
bottleneck moves to somewhere else, that new hot spot might be one people
care more about.  Given that there are multiple good pooling solutions
floating around already, it's hard to justify dumping coding and testing
resources here if that makes the problem move somewhere else.

Its worth ruling out given that even if the likelihood is small, the fix is 
easy.  However, I don't see the throughput drop from peak as more concurrency 
is added that is the hallmark of this problem - usually with a lot of context 
switching and a sudden increase in CPU use per transaction.

The biggest disconnect in load testing almost always occurs over the definition 
of "concurrent users".
Think of an HTTP app, backed by a db - about as simple as it gets these days 
(this is fun with 5, 6 tier fanned out stuff).

"Users" could mean:
Number of application user logins used.
Number of test harness threads or processes that are active.
Number of open HTTP connections
Number of HTTP requests being processed
Number of connections from the app to the db
Number of active connections from the app to the db

Knowing which of these is the topic, and what that means in relation to all the 
others, is often messy.  Without knowing which one it is in a result, you can 
still learn a lot.  The data in the results here prove its not the last one on 
the list above, nor the first one.  It could still be any of the middle four, 
but is most likely #2 or the second to last one (which might be equivalent).

Reply via email to