Re: [GENERAL] Puzzling full database lock

2012-02-06 Thread Merlin Moncure
On Fri, Feb 3, 2012 at 2:55 PM, Christopher Opena counterv...@gmail.com wrote:
 Merlin, thanks for the response.

no problem.  if you're open to architecture suggestions you might also
want to consider going with HS/SR and getting those large olap queries
off your main database.  you'll have to configure it to be very
forgiving of long running queries (which can pause replication) but
this should be a huge win in terms of total load on the main server.
also, you can skimp on hardware to a certain degree to cut costs.  we
still want to get to the bottom of this obviously but it can't hurt to
get a plan 'B' going...

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-03 Thread Christopher Opena
Merlin, thanks for the response.  My comments below, but firstly, does
anyone know if autovacuum is affected by setting a statement_timeout?
 There was a long thread here from 2007'ish:

http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/80044/focus=93847

But it's unclear to me which way that ended up going.  We're thinking of
setting statement_timeout to something fairy high (that will catch these
queries we've been seeing) so that we can further troubleshoot over time.
 We're worried, however, that autovacuum might be affected.

Random thoughts/suggestions:
 *) Look for some correlation between non-idle process count and
 locking situation.  You are running a lot of processes and if I was in
 your shoes I would be strongly looking at pgbouncer to handle
 connection pooling.  You could be binding in the database or (worse)
 the kernel


We're definitely looking at our options with pgbouncer right now; issue
being that we'd have to have a bouncer per database, and our architecture
right now calls for many databases, many connections so we're trying to
limit that by having our application pooler limit the amount of active
connections one can have to the application itself (and thereby to the
database, by proxy).  This is still an option, however, so we're doing some
research here.


 *) Try logging checkpoints to see if there is any correlation with your
 locks.


We've been logging checkpoints for several days now with no hard
correlation that we can find.  Thanks for the suggestion though!


 *) An strace of both the 'high cpu' process and one of the blocked
 process might give some clues -- in particular if you are being
 blocked on a system call


We have yet to try this; definitely next in line.


 *) Given enough time, do your high cpu queries ever complete? Are they
 writing or reading?


The queries are reading in this case; we haven't allowed them to run their
course because of the effect it has on our entire user base.  Right now
we've patched our application to catch these cases and handle them outright
by notifying the end user that there is a potentially damaging query that
is being cancelled.  Short term solution, but for now it's something we
need to do until we can replicate and solve the problem on a non-production
system.


 *) What happens to overall system load if you lower shared_buffers to,
 say, 4gb?


We're going to be trying this as well once we have an appropriate
maintenance window.  It seems to be a general consensus that this is
something we should at least try.

Thanks,
-Chris.


Re: [GENERAL] Puzzling full database lock

2012-02-03 Thread Tom Lane
Christopher Opena counterv...@gmail.com writes:
 Merlin, thanks for the response.  My comments below, but firstly, does
 anyone know if autovacuum is affected by setting a statement_timeout?

It is not; in all recent PG releases, the autovacuum processes are
careful to force a session setting of zero.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-02 Thread Merlin Moncure
On Wed, Feb 1, 2012 at 6:38 PM, Christopher Opena counterv...@gmail.com wrote:
 Hello folks,

 We've been running into some very strange issues of late with our PostgreSQL
 database(s).  We have an issue where a couple of queries push high CPU on a
 few of our processors and the entire database locks (reads, writes, console
 cannot be achieved unless the high CPU query procs are killed).  Further
 investigation shows ~59% total cpu usage (we have 16 total cores), low io,
 and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB).
  We had previously seen some high io problems but those turned out to be
 unconnected and ultimately solved, yet we are still seeing a complete lock
 of the DB occasionally as previously described.

 The queries themselves are not any different than normal usage on other
 databases; they are pulling back a little more data but there's nothing that
 stands out about them as far as query construction.

 One thing that we aren't sure of is whether or not we are running into a
 general connection pooling issue.  Our typical number of postgresql
 processes fluctuates between 1,400 and 1,600 - most of which are idle - as
 we have a number of application servers all connecting to a central
 read/write master (the master replicates out to a secondary via streaming
 replication).  We have max_processes set to 3,000 after tweaking some kernel
 memory parameters so at least we know we aren't exceeding that, but is there
 a practical real world limit or issue with setting this too high?

 Ultimately, the problem we're seeing is a full read/write lock on a system
 that is apparently at medium usage levels once we got rid of our high io red
 herring.  Honestly I'm a little stumped as to where to look next; is there
 some specific metric I might be missing here?

 Any help is greatly appreciated,

Random thoughts/suggestions:
*) Look for some correlation between non-idle process count and
locking situation.  You are running a lot of processes and if I was in
your shoes I would be strongly looking at pgbouncer to handle
connection pooling.  You could be binding in the database or (worse)
the kernel

*) Try logging checkpoints to see if there is any correlation with your locks.

*) An strace of both the 'high cpu' process and one of the blocked
process might give some clues -- in particular if you are being
blocked on a system call

*) Given enough time, do your high cpu queries ever complete? Are they
writing or reading?

*) What happens to overall system load if you lower shared_buffers to, say, 4gb?

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-02 Thread Florian Weimer
* Christopher Opena:

 We've been running into some very strange issues of late with our
 PostgreSQL database(s).  We have an issue where a couple of queries push
 high CPU on a few of our processors and the entire database locks (reads,
 writes, console cannot be achieved unless the high CPU query procs are
 killed).

Does the kernel log something?  Does dmesg display anything
illuminating?

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Carlos Mennens
On Wed, Feb 1, 2012 at 7:38 PM, Christopher Opena counterv...@gmail.com wrote:
 Hello folks,

 We've been running into some very strange issues of late with our PostgreSQL
 database(s).  We have an issue where a couple of queries push high CPU on a
 few of our processors and the entire database locks (reads, writes, console
 cannot be achieved unless the high CPU query procs are killed).  Further
 investigation shows ~59% total cpu usage (we have 16 total cores), low io,
 and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB).

Just out of curiosity, what OS are you running?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Christopher Opena
It's CentOS 5.5, PostgreSQL version 9.0.4 (x86_64).

On Wed, Feb 1, 2012 at 4:44 PM, Carlos Mennens carlos.menn...@gmail.comwrote:

 On Wed, Feb 1, 2012 at 7:38 PM, Christopher Opena counterv...@gmail.com
 wrote:
  Hello folks,
 
  We've been running into some very strange issues of late with our
 PostgreSQL
  database(s).  We have an issue where a couple of queries push high CPU
 on a
  few of our processors and the entire database locks (reads, writes,
 console
  cannot be achieved unless the high CPU query procs are killed).  Further
  investigation shows ~59% total cpu usage (we have 16 total cores), low
 io,
  and mid-to-low memory usage (we have 74GB of memory,
 shared_buffers=16GB).

 Just out of curiosity, what OS are you running?



Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Carlos Mennens
On Wed, Feb 1, 2012 at 7:51 PM, Christopher Opena counterv...@gmail.com wrote:
 It's CentOS 5.5, PostgreSQL version 9.0.4 (x86_64).

That seems extremely bleeding edge for CentOS. Did you compile this
package from source RPM or some 3rd party package maintainer for
PostgreSQL?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Alan Hodgson
 On Wed, Feb 1, 2012 at 7:38 PM, Christopher Opena counterv...@gmail.com 
wrote:
  Hello folks,
  
  We've been running into some very strange issues of late with our
  PostgreSQL database(s).  We have an issue where a couple of queries
  push high CPU on a few of our processors and the entire database locks
  (reads, writes, console cannot be achieved unless the high CPU query
  procs are killed).  Further investigation shows ~59% total cpu usage
  (we have 16 total cores), low io, and mid-to-low memory usage (we have
  74GB of memory, shared_buffers=16GB).

Define low I/O. The only things I've ever seen interfere with console access  
are running out of memory or excessive disk i/o. If you're seeing even 6-12% 
iowait on a 16 core machine you might very well have swamped your available 
disk I/O.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Christopher Opena
It was installed from pgrpms.org's repository.

On Wed, Feb 1, 2012 at 4:55 PM, Carlos Mennens carlos.menn...@gmail.comwrote:

 On Wed, Feb 1, 2012 at 7:51 PM, Christopher Opena counterv...@gmail.com
 wrote:
  It's CentOS 5.5, PostgreSQL version 9.0.4 (x86_64).

 That seems extremely bleeding edge for CentOS. Did you compile this
 package from source RPM or some 3rd party package maintainer for
 PostgreSQL?



Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Ondrej Ivanič
Hi,

On 2 February 2012 11:38, Christopher Opena counterv...@gmail.com wrote:
 We've been running into some very strange issues of late with our PostgreSQL
 database(s).  We have an issue where a couple of queries push high CPU on a
 few of our processors and the entire database locks (reads, writes, console
 cannot be achieved unless the high CPU query procs are killed).  Further
 investigation shows ~59% total cpu usage (we have 16 total cores), low io,
 and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB).

I think 16GB is too much. We started with 9GB (16 cores, 80GB RAM,
SAN) and then experimented with lower value (6GB) but never used in
the production  because we switched to different database / storage
technology. Anyway, Overal CPU utilisation was lower using 6GB.

If CPU util is high because of io waits then it might be worth to play
with dirty_background_ratio and dirty_ratio. The problem is that the
value is percentage and you have 74GB. CentOS has 10% and 40% as
default value for dirty_background_ratio and dirty_ratio respectively.
10% of 74GB is 7.4GB and there is no storage controller with 7.4GB of
cache so you get IO waits (and high load). So writes will backup until
you hit 40% hard limit (vm.dirty_ratio) which is even worse (~30GB to
flush). I think you should try lower both. For example, try 1 and 10
for vm.dirty_background_ratio and vm.dirty_ratio respectively.

 One thing that we aren't sure of is whether or not we are running into a
 general connection pooling issue.  Our typical number of postgresql
 processes fluctuates between 1,400 and 1,600 - most of which are idle - as
 we have a number of application servers all connecting to a central
 read/write master (the master replicates out to a secondary via streaming
 replication).  We have max_processes set to 3,000 after tweaking some kernel
 memory parameters so at least we know we aren't exceeding that, but is there
 a practical real world limit or issue with setting this too high?

I would use connection pooler like PG-Pool II. It can add transparent
failover and you don't need max_processes set so high (plus parallel
query feature could be useful).

-- 
Ondrej Ivanic
(ondrej.iva...@gmail.com)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Christopher Opena
Do you mean 6-12% of total iowait, or per cpu?  Our average iowait in the
last week is 34.31% of a total 1600% with an average idle of 1451.76%.  Our
iowait *does* spike occasionally (today it went up to 148.01%) but it
doesn't coincide with the lock happening.  At the time of the lock we were
at 10.58% iowait, which is quite a bit below our average.

Thanks,
-Chris.

On Wed, Feb 1, 2012 at 4:55 PM, Alan Hodgson ahodg...@simkin.ca wrote:


 Define low I/O. The only things I've ever seen interfere with console
 access
 are running out of memory or excessive disk i/o. If you're seeing even
 6-12%
 iowait on a 16 core machine you might very well have swamped your available
 disk I/O.


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Alan Hodgson
On Wednesday, February 01, 2012 05:13:15 PM Christopher Opena wrote:
 Do you mean 6-12% of total iowait, or per cpu?  Our average iowait in the
 last week is 34.31% of a total 1600% with an average idle of 1451.76%.  Our
 iowait *does* spike occasionally (today it went up to 148.01%) but it
 doesn't coincide with the lock happening.  At the time of the lock we were
 at 10.58% iowait, which is quite a bit below our average.
 

Total, but it doesn't sound like that's the problem.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Puzzling full database lock

2012-02-01 Thread Christopher Opena
Yeah, it's strange because we definitely have periods of high iowait but
this is not when the locks are happening.  If I could correlate it directly
to that it would be so much easier.  Thanks again for the response!

On Wed, Feb 1, 2012 at 8:42 PM, Alan Hodgson ahodg...@simkin.ca wrote:

 On Wednesday, February 01, 2012 05:13:15 PM Christopher Opena wrote:
  Do you mean 6-12% of total iowait, or per cpu?  Our average iowait in the
  last week is 34.31% of a total 1600% with an average idle of 1451.76%.
  Our
  iowait *does* spike occasionally (today it went up to 148.01%) but it
  doesn't coincide with the lock happening.  At the time of the lock we
 were
  at 10.58% iowait, which is quite a bit below our average.
 

 Total, but it doesn't sound like that's the problem.