Re: [GENERAL] High SYS CPU - need advise

2012-11-21 Thread Vlad Marchenko
update on my problem: despite pgbouncer, the problem still occures on my 
end.


Also, interesting observation - I ran several tests with pgbench, using 
queries that I think are prone to trigger high-sys-cpu-stall. What I 
noticed is when pgbench is started with prepared mode, the system 
behaves fine during stress-test (high user cpu - 85-90%, low sys cpu - 
5-7%), high TPS. Though when I used non-prepared modes, the sys cpu 
portion jumps to 40% (and tps drops dramatically as well, but this is 
understandable).  The test queries are pretty long (10kb+), with couple 
of outer joins across 1000-record tables with indexes.


Maybe, we are looking in a wrong place and the issue is somewhere within 
planer/parser? Is there some extensive locking used in there?


Another observation - it's harder to trigger high-sys-cpu stall on a 
freshly restarted postgresql. Though if it was running for a while, then 
it's much easier to do.


-- vlad

On 11/19/12 8:33 AM, Merlin Moncure wrote:

On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes jeff.ja...@gmail.com wrote:

On Fri, Nov 16, 2012 at 12:13 PM, Vlad marche...@gmail.com wrote:

ok, I've applied that patch and ran. The stall started around 13:50:45...50
and lasted until the end

https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log

That isn't as much log as I expected.  But I guess only the tip of the
iceberg gets logged, it could be that most spinlocks are contended but
then get acquired just before the sleep (and log) would have happened.

I'm not sure how to figure out just how much spinning is going on
below the surface, but my gut feeling is there isn't enough of it to
explain the amount of slowdown you are seeing (and it would probably
be accounted for as user time rather than system time)


yes. classic spinlock contention symptoms are high user cpu and almost
work getting done.   something else is going on here?


Maybe we could whip up a dummy standalone  program that uses PG's
spinlock code in a intentionally contended  way and see how your
system reports resource usage for that.  Maybe your system reports
fighting for cache lines or whatever is going on behind the scenes as
system time rather than user time, or something.

I don't think this is the case.  It's looking more and more like
scheduler issues, although it's unclear if that is due to some kind of
regression or just symptoms of backends piling up.  I'm also starting
to wonder if the spinlock contention we are seeing (which according to
your patch really isn't all that much really) is just symptomatic of
scheduler issues.


It would be a little easier to figure out what was going on there if
the log included the pid (%p) so we can see if it is the same process
doing a bunch of consecutive sleeps, or a bunch of different
processes.  Also, logging the millisecond (%m rather than %t) would be
helpful.

But the way that the point of contention jumps around a lot points to
a scheduler issue rather than a postgres-specific issue.  A backend
gets pre-empted while holding a spinlock, and then the kernel is
resistant to rescheduling it even though all other processes are piled
up behind it.  But again, that theory would lead to more log lines
than we see, I would think.

yeah.  I asked for some longer strace logs but then everyone went home
for the weekend.


Anyway, pgbouncer might help by ensuring that a connection gets
pre-empted by pgbouncer at a point where it has no locks, rather than
by the kernel at a point where it does hold locks.

yes -- I think pgbouncer is the key here because it's going to tell us
if the scheduler is the culprit here.  If you look carefully at the
strace logs, you'll see that system call times become unpredictable as
they are getting scheduled out...particularly yielding calls like
select().  So we need to determine if this is normal high load
behavior or if there is an underlying mechanic.


Any chance you could try an older kernel and see if that reduces the problem?

This is one thing to try.  Personally I think pgbouncer/transaction
mode is low hanging fruit if it 'drops in' to OP's app. If problem
goes away this will pretty much point the finger at Linux AFAICT (in
which case we can start messing around with the kernel).

merlin




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] High SYS CPU - need advise

2012-11-19 Thread Vlad Marchenko
OK, so far I settled on excluding connection caching on app side 
(Apache::DBI and prepare_cached) from equation and adding pgbouncer as a 
counter-measure. This seems to stabilize the situation - at least I'm 
not able to push server into high-sys-cpu stall the way how I used to do.


I'm still interested in getting to the bottom of the issue vs 
alleviating it with connection poller, so I might setup a server on 
Centos 6.3 (kernel ver 2.6.32) and give it a try.


Alternatively, if anyone have ideas on playing with sysctl parameters 
related to scheduler - please let me know.


-- vlad





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] High SYS CPU - need advise

2012-11-15 Thread Vlad Marchenko

Tom,

I just checked the version I'm running (9.1.6), and the code is quite 
similar (src/backend/storage/lmgr/s_lock.c)


pg_usleep(cur_delay * 1000L);

#if defined(S_LOCK_TEST)
fprintf(stdout, *);
fflush(stdout);
#endif

/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
  ((double) random() / (double) MAX_RANDOM_VALUE) + 
0.5);

/* wrap back to minimum delay when max is exceeded */
if (cur_delay  MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;

-- vlad




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 8.3RC2 vs 8.2.6 testing results

2008-01-29 Thread Vlad Marchenko

Tom:

Yes, they are ints. To (somewhat) check your guess on the role of the hash 
aggregation speed, I just ran oltp test from sysbench 
(http://sysbench.sourceforge.net/docs/#database_mode) on a table with 1mln 
of records. That test uses pretty simple queries (that do not use 
aggregation) and 8.3 showed about the same performance as 8.2 (strictly 
speaking 8.3 was about 1-2% slower, but not 10-15% like on my query).


I'm curious if that new hash aggregation algorithm was put in 8.3 with the 
performance increase as a goal or it was simply a required change to support 
some other new feature of 8.3? Right now the switch from 8.2 to 8.3 doesn't 
seems a favorable step for the type of application that we have...


-- vlad

- Original Message - 
From: Tom Lane [EMAIL PROTECTED]

To: Vlad [EMAIL PROTECTED]
Cc: PG-General pgsql-general@postgresql.org
Sent: Monday, January 28, 2008 9:13 PM
Subject: Re: [GENERAL] 8.3RC2 vs 8.2.6 testing results



Vlad [EMAIL PROTECTED] writes:

2. We ran several tests and found 8.3 generally 10% slower than 8.2.6.


The particular case you are showing here seems to be all about the speed
of hash aggregation --- at least the time differential is mostly in the
HashAggregate step.  What is the data type of a_id?  I speculate that
you're noticing the slightly slower/more complicated hash function that
8.3 uses for integers.  On a case where the data was well distributed
you'd not see any countervailing efficiency gain from those extra
cycles.

regards, tom lane




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings