Re: [PERFORM] Intermittent hangs with 9.2

2013-09-20 Thread David Whittaker
We haven't seen any issues since we decreased shared_buffers.  We also
tuned some of the longer running / more frequently executed queries, so
that may have had an effect as well, but my money would be on the
shared_buffers change.  If the issue re-appears I'll try to get a perf
again and post back, but if you don't hear from me again you can assume the
problem is solved.

Thank you all again for the help.

-Dave

On Fri, Sep 13, 2013 at 11:05 AM, David Whittaker d...@iradix.com wrote:




 On Fri, Sep 13, 2013 at 10:52 AM, Merlin Moncure mmonc...@gmail.comwrote:

 On Thu, Sep 12, 2013 at 3:06 PM, David Whittaker d...@iradix.com wrote:
  Hi All,
 
  We lowered shared_buffers to 8G and increased effective_cache_size
  accordingly.  So far, we haven't seen any issues since the adjustment.
  The
  issues have come and gone in the past, so I'm not convinced it won't
 crop up
  again, but I think the best course is to wait a week or so and see how
  things work out before we make any other changes.
 
  Thank you all for your help, and if the problem does reoccur, we'll look
  into the other options suggested, like using a patched postmaster and
  compiling for perf -g.
 
  Thanks again, I really appreciate the feedback from everyone.

 Interesting -- please respond with a follow up if/when you feel
 satisfied the problem has gone away.  Andres was right; I initially
 mis-diagnosed the problem (there is another issue I'm chasing that has
 a similar performance presentation but originates from a different
 area of the code).

 That said, if reducing shared_buffers made *your* problem go away as
 well, then this more evidence that we have an underlying contention
 mechanic that is somehow influenced by the setting.  Speaking frankly,
 under certain workloads we seem to have contention issues in the
 general area of the buffer system.  I'm thinking (guessing) that the
 problems is usage_count is getting incremented faster than the buffers
 are getting cleared out which is then causing the sweeper to spend
 more and more time examining hotly contended buffers.  This may make
 no sense in the context of your issue; I haven't looked at the code
 yet.  Also, I've been unable to cause this to happen in simulated
 testing.  But I'm suspicious (and dollars to doughnuts '0x347ba9' is
 spinlock related).

 Anyways, thanks for the report and (hopefully) the follow up.

 merlin


 You guys have taken the time to help me through this, following up is the
 least I can do.  So far we're still looking good.



Re: [PERFORM] Intermittent hangs with 9.2

2013-09-13 Thread Merlin Moncure
On Thu, Sep 12, 2013 at 3:06 PM, David Whittaker d...@iradix.com wrote:
 Hi All,

 We lowered shared_buffers to 8G and increased effective_cache_size
 accordingly.  So far, we haven't seen any issues since the adjustment.  The
 issues have come and gone in the past, so I'm not convinced it won't crop up
 again, but I think the best course is to wait a week or so and see how
 things work out before we make any other changes.

 Thank you all for your help, and if the problem does reoccur, we'll look
 into the other options suggested, like using a patched postmaster and
 compiling for perf -g.

 Thanks again, I really appreciate the feedback from everyone.

Interesting -- please respond with a follow up if/when you feel
satisfied the problem has gone away.  Andres was right; I initially
mis-diagnosed the problem (there is another issue I'm chasing that has
a similar performance presentation but originates from a different
area of the code).

That said, if reducing shared_buffers made *your* problem go away as
well, then this more evidence that we have an underlying contention
mechanic that is somehow influenced by the setting.  Speaking frankly,
under certain workloads we seem to have contention issues in the
general area of the buffer system.  I'm thinking (guessing) that the
problems is usage_count is getting incremented faster than the buffers
are getting cleared out which is then causing the sweeper to spend
more and more time examining hotly contended buffers.  This may make
no sense in the context of your issue; I haven't looked at the code
yet.  Also, I've been unable to cause this to happen in simulated
testing.  But I'm suspicious (and dollars to doughnuts '0x347ba9' is
spinlock related).

Anyways, thanks for the report and (hopefully) the follow up.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-13 Thread David Whittaker
On Fri, Sep 13, 2013 at 10:52 AM, Merlin Moncure mmonc...@gmail.com wrote:

 On Thu, Sep 12, 2013 at 3:06 PM, David Whittaker d...@iradix.com wrote:
  Hi All,
 
  We lowered shared_buffers to 8G and increased effective_cache_size
  accordingly.  So far, we haven't seen any issues since the adjustment.
  The
  issues have come and gone in the past, so I'm not convinced it won't
 crop up
  again, but I think the best course is to wait a week or so and see how
  things work out before we make any other changes.
 
  Thank you all for your help, and if the problem does reoccur, we'll look
  into the other options suggested, like using a patched postmaster and
  compiling for perf -g.
 
  Thanks again, I really appreciate the feedback from everyone.

 Interesting -- please respond with a follow up if/when you feel
 satisfied the problem has gone away.  Andres was right; I initially
 mis-diagnosed the problem (there is another issue I'm chasing that has
 a similar performance presentation but originates from a different
 area of the code).

 That said, if reducing shared_buffers made *your* problem go away as
 well, then this more evidence that we have an underlying contention
 mechanic that is somehow influenced by the setting.  Speaking frankly,
 under certain workloads we seem to have contention issues in the
 general area of the buffer system.  I'm thinking (guessing) that the
 problems is usage_count is getting incremented faster than the buffers
 are getting cleared out which is then causing the sweeper to spend
 more and more time examining hotly contended buffers.  This may make
 no sense in the context of your issue; I haven't looked at the code
 yet.  Also, I've been unable to cause this to happen in simulated
 testing.  But I'm suspicious (and dollars to doughnuts '0x347ba9' is
 spinlock related).

 Anyways, thanks for the report and (hopefully) the follow up.

 merlin


You guys have taken the time to help me through this, following up is the
least I can do.  So far we're still looking good.


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-12 Thread David Whittaker
Hi All,

We lowered shared_buffers to 8G and increased effective_cache_size
accordingly.  So far, we haven't seen any issues since the adjustment.  The
issues have come and gone in the past, so I'm not convinced it won't crop
up again, but I think the best course is to wait a week or so and see how
things work out before we make any other changes.

Thank you all for your help, and if the problem does reoccur, we'll look
into the other options suggested, like using a patched postmaster and
compiling for perf -g.

Thanks again, I really appreciate the feedback from everyone.

-Dave


On Wed, Sep 11, 2013 at 1:17 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
   I've been seeing a strange issue with our Postgres install for about a
 year
   now, and I was hoping someone might be able to help point me at the
 cause.
   At what seem like fairly random intervals Postgres will become
 unresponsive
   to the 3 application nodes it services. These periods tend to last for
 10 -
   15 minutes before everything rights itself and the system goes back to
   normal.
  
   During these periods the server will report a spike in the outbound
   bandwidth (from about 1mbs to about 5mbs most recently), a huge spike
 in
   context switches / interrupts (normal peaks are around 2k/8k
 respectively,
   and during these periods they‘ve gone to 15k/22k), and a load average
 of
   100+. CPU usage stays relatively low, but it’s all system time
 reported,
   user time goes to zero. It doesn‘t seem to be disk related since we’re
   running with a shared_buffers setting of 24G, which will fit just
 about our
   entire database into memory, and the IO transactions reported by the
 server,
   as well as the disk reads reported by Postgres stay consistently low.
  
   We‘ve recently started tracking how long statements take to execute,
 and
   we’re seeing some really odd numbers. A simple delete by primary key,
 for
   example, from a table that contains about 280,000 rows, reportedly took
   18h59m46.900s. An update by primary key in that same table was
 reported as
   7d 17h 58m 30.415s. That table is frequently accessed, but obviously
 those
   numbers don't seem reasonable at all.
  
   Some other changes we've made to postgresql.conf:
  
   synchronous_commit = off
  
   maintenance_work_mem = 1GB
   wal_level = hot_standby
   wal_buffers = 16MB
  
   max_wal_senders = 10
  
   wal_keep_segments = 5000
  
   checkpoint_segments = 128
  
   checkpoint_timeout = 30min
  
   checkpoint_completion_target = 0.9
  
   max_connections = 500
  
   The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB
 of
   RAM, running Cent OS 6.3.
  
   So far we‘ve tried disabling Transparent Huge Pages after I found a
 number
   of resources online that indicated similar interrupt/context switch
 issues,
   but it hasn’t resolve the problem. I managed to catch it happening
 once and
   run a perf which showed:
  
   +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
   +   9.55%   10956  postmaster  0x2dc820 f set_config_option
   +   8.64%9946  postmaster  0x5a3d4  f writeListPage
   +   5.75%6609  postmaster  0x5a2b0  f
   ginHeapTupleFastCollect
   +   2.68%3084  postmaster  0x192483 f
   build_implied_join_equality
   +   2.61%2990  postmaster  0x187a55 f
 build_paths_for_OR
   +   1.86%2131  postmaster  0x794aa  f get_collation_oid
   +   1.56%1822  postmaster  0x5a67e  f
 ginHeapTupleFastInsert
   +   1.53%1766  postmaster  0x1929bc f
   distribute_qual_to_rels
   +   1.33%1558  postmaster  0x249671 f cmp_numerics
  
   I‘m not sure what 0x347ba9 represents, or why it’s an address rather
 than a
   method name.

 Try converting it to something more meaningful with addr2line, that
 often has more sucess.

   That's about the sum of it. Any help would be greatly appreciated and
 if you
   want any more information about our setup, please feel free to ask.

  Reducing shared buffers to around 2gb will probably make the problem go
 away

 That profile doesn't really look like one of the problem you are
 referring to would look like.

 Based on the profile I'd guess it's possible that you're seing problems
 with GIN's fastupdate mechanism.
 Try ALTER INDEX whatever SET (FASTUPDATE = OFF); VACUUM
 whatever's_table for all gin indexes.

 It's curious that set_config_option is so high in the profile... Any
 chance you could recompile postgres with -fno-omit-frame-pointers in
 CFLAGS? That would allow you to use perf -g. The performance price of
 that usually is below 1% for postgres.

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [PERFORM] Intermittent hangs with 9.2

2013-09-11 Thread Torsten Förtsch
On 10/09/13 20:04, David Whittaker wrote:
 On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan and...@dunslane.net
 mailto:and...@dunslane.net wrote:
 
 
 On 09/10/2013 11:04 AM, David Whittaker wrote:
 
 
 Hi All,
 
 I've been seeing a strange issue with our Postgres install for
 about a year now, and I was hoping someone might be able to help
 point me at the cause. At what seem like fairly random intervals
 Postgres will become unresponsive to the 3 application nodes it
 services. These periods tend to last for 10 - 15 minutes before
 everything rights itself and the system goes back to normal.
 
 During these periods the server will report a spike in the
 outbound bandwidth (from about 1mbs to about 5mbs most
 recently), a huge spike in context switches / interrupts (normal
 peaks are around 2k/8k respectively, and during these periods
 they‘ve gone to 15k/22k), and a load average of 100+. CPU usage
 stays relatively low, but it’s all system time reported, user
 time goes to zero. It doesn‘t seem to be disk related since
 we’re running with a shared_buffers setting of 24G, which will
 fit just about our entire database into memory, and the IO
 transactions reported by the server, as well as the disk reads
 reported by Postgres stay consistently low.
 
 We‘ve recently started tracking how long statements take to
 execute, and we’re seeing some really odd numbers. A simple
 delete by primary key, for example, from a table that contains
 about 280,000 rows, reportedly took 18h59m46.900s. An update by
 primary key in that same table was reported as 7d 17h 58m
 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.
 
 Some other changes we've made to postgresql.conf:
 
 synchronous_commit = off
 
 maintenance_work_mem = 1GB
 wal_level = hot_standby
 wal_buffers = 16MB
 
 max_wal_senders = 10
 
 wal_keep_segments = 5000
 
 checkpoint_segments = 128
 
 checkpoint_timeout = 30min
 
 checkpoint_completion_target = 0.9
 
 max_connections = 500
 
 The server is a Dell Poweredge R900 with 4 Xeon E7430
 processors, 48GB of RAM, running Cent OS 6.3.
 
 So far we‘ve tried disabling Transparent Huge Pages after I
 found a number of resources online that indicated similar
 interrupt/context switch issues, but it hasn’t resolve the
 problem. I managed to catch it happening once and run a perf
 which showed:
 
 |
 +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
 +   9.55%   10956  postmaster  0x2dc820 f
 set_config_option
 +   8.64%9946  postmaster  0x5a3d4  f writeListPage
 +   5.75%6609  postmaster  0x5a2b0  f
 ginHeapTupleFastCollect
 +   2.68%3084  postmaster  0x192483 f
 build_implied_join_equality
 +   2.61%2990  postmaster  0x187a55 f
 build_paths_for_OR
 +   1.86%2131  postmaster  0x794aa  f
 get_collation_oid
 +   1.56%1822  postmaster  0x5a67e  f
 ginHeapTupleFastInsert
 +   1.53%1766  postmaster  0x1929bc f
 distribute_qual_to_rels
 +   1.33%1558  postmaster  0x249671 f cmp_numerics|
 
 I‘m not sure what 0x347ba9 represents, or why it’s an address
 rather than a method name.
 
 That's about the sum of it. Any help would be greatly
 appreciated and if you want any more information about our
 setup, please feel free to ask.
 
 
 
 I have seen cases like this with very high shared_buffers settings.
 
 24Gb for shared_buffers is quite high, especially on a 48Gb box.
 What happens if you dial that back to, say, 12Gb?
 
 
 I'd be willing to give it a try.  I'd really like to understand what's
 going on here though.  Can you elaborate on that?  Why would 24G of
 shared buffers be too high in this case?  The machine is devoted
 entirely to PG, so having PG use half of the available RAM to cache data
 doesn't feel unreasonable.

Here is what I have recently learned.

The root cause is crash safety and checkpoints. This is certainly
something you want. When you write to the database these operations
first occur in the buffer cache and the particular buffer you write to
is marked dirty. The cache is organized in chunks of 8kb. Additionally
write operations are also committed to the WAL.

A checkpoint iterates over all dirty buffers writing them to the
database. After that all buffers are clean again.

Now, if you write to a clean buffer it gets entirely written to the WAL.
That means after 

Re: [PERFORM] Intermittent hangs with 9.2

2013-09-11 Thread Julien Cigar
On Tue, Sep 10, 2013 at 02:04:57PM -0400, David Whittaker wrote:
 Hi Andrew,
 
 
 On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan and...@dunslane.netwrote:
 
 
  On 09/10/2013 11:04 AM, David Whittaker wrote:
 
 
  Hi All,
 
  I've been seeing a strange issue with our Postgres install for about a
  year now, and I was hoping someone might be able to help point me at the
  cause. At what seem like fairly random intervals Postgres will become
  unresponsive to the 3 application nodes it services. These periods tend to
  last for 10 - 15 minutes before everything rights itself and the system
  goes back to normal.
 
  During these periods the server will report a spike in the outbound
  bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
  context switches / interrupts (normal peaks are around 2k/8k respectively,
  and during these periods they‘ve gone to 15k/22k), and a load average of
  100+. CPU usage stays relatively low, but it’s all system time reported,
  user time goes to zero. It doesn‘t seem to be disk related since we’re
  running with a shared_buffers setting of 24G, which will fit just about our
  entire database into memory, and the IO transactions reported by the
  server, as well as the disk reads reported by Postgres stay consistently
  low.
 
  We‘ve recently started tracking how long statements take to execute, and
  we’re seeing some really odd numbers. A simple delete by primary key, for
  example, from a table that contains about 280,000 rows, reportedly took
  18h59m46.900s. An update by primary key in that same table was reported as
  7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
  numbers don't seem reasonable at all.
 
  Some other changes we've made to postgresql.conf:
 
  synchronous_commit = off
 
  maintenance_work_mem = 1GB
  wal_level = hot_standby
  wal_buffers = 16MB
 
  max_wal_senders = 10
 
  wal_keep_segments = 5000
 
  checkpoint_segments = 128
 
  checkpoint_timeout = 30min
 
  checkpoint_completion_target = 0.9
 
  max_connections = 500
 
  The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
  RAM, running Cent OS 6.3.
 
  So far we‘ve tried disabling Transparent Huge Pages after I found a
  number of resources online that indicated similar interrupt/context switch
  issues, but it hasn’t resolve the problem. I managed to catch it happening
  once and run a perf which showed:
 
  |
  +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
  +   9.55%   10956  postmaster  0x2dc820 f set_config_option
  +   8.64%9946  postmaster  0x5a3d4  f writeListPage
  +   5.75%6609  postmaster  0x5a2b0  f
  ginHeapTupleFastCollect
  +   2.68%3084  postmaster  0x192483 f
  build_implied_join_equality
  +   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
  +   1.86%2131  postmaster  0x794aa  f get_collation_oid
  +   1.56%1822  postmaster  0x5a67e  f
  ginHeapTupleFastInsert
  +   1.53%1766  postmaster  0x1929bc f
  distribute_qual_to_rels
  +   1.33%1558  postmaster  0x249671 f cmp_numerics|
 
  I‘m not sure what 0x347ba9 represents, or why it’s an address rather than
  a method name.
 
  That's about the sum of it. Any help would be greatly appreciated and if
  you want any more information about our setup, please feel free to ask.
 
 
 
  I have seen cases like this with very high shared_buffers settings.
 
  24Gb for shared_buffers is quite high, especially on a 48Gb box. What
  happens if you dial that back to, say, 12Gb?
 
 
 I'd be willing to give it a try.  I'd really like to understand what's
 going on here though.  Can you elaborate on that?  Why would 24G of shared
 buffers be too high in this case?  The machine is devoted entirely to PG,
 so having PG use half of the available RAM to cache data doesn't feel
 unreasonable.

Some of the overhead of bgwriter and checkpoints is more or less linear
in the size of shared_buffers. If your shared_buffers is large a lot of
data could be dirty when a checkpoint starts, resulting in an I/O spike
... (although we've spread checkpoints in recent pg versions, so this
should be less a problem nowadays).
Another reason is that the OS cache is also being used for reads and
writes and with a large shared_buffers there is a risk of doubly cached
data (in the OS cache + in shared_buffers).
In an ideal world most frequently used blocks should be in
shared_buffers and less frequently used block in the OS cache ..

 
 
 
  cheers
 
  andrew
 
 

-- 
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-11 Thread Merlin Moncure
On Tue, Sep 10, 2013 at 10:04 AM, David Whittaker d...@iradix.com wrote:
 Hi All,

 I've been seeing a strange issue with our Postgres install for about a year
 now, and I was hoping someone might be able to help point me at the cause.
 At what seem like fairly random intervals Postgres will become unresponsive
 to the 3 application nodes it services. These periods tend to last for 10 -
 15 minutes before everything rights itself and the system goes back to
 normal.

 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+. CPU usage stays relatively low, but it’s all system time reported,
 user time goes to zero. It doesn‘t seem to be disk related since we’re
 running with a shared_buffers setting of 24G, which will fit just about our
 entire database into memory, and the IO transactions reported by the server,
 as well as the disk reads reported by Postgres stay consistently low.

 We‘ve recently started tracking how long statements take to execute, and
 we’re seeing some really odd numbers. A simple delete by primary key, for
 example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.

 Some other changes we've made to postgresql.conf:

 synchronous_commit = off

 maintenance_work_mem = 1GB
 wal_level = hot_standby
 wal_buffers = 16MB

 max_wal_senders = 10

 wal_keep_segments = 5000

 checkpoint_segments = 128

 checkpoint_timeout = 30min

 checkpoint_completion_target = 0.9

 max_connections = 500

 The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
 RAM, running Cent OS 6.3.

 So far we‘ve tried disabling Transparent Huge Pages after I found a number
 of resources online that indicated similar interrupt/context switch issues,
 but it hasn’t resolve the problem. I managed to catch it happening once and
 run a perf which showed:

 +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
 +   9.55%   10956  postmaster  0x2dc820 f set_config_option
 +   8.64%9946  postmaster  0x5a3d4  f writeListPage
 +   5.75%6609  postmaster  0x5a2b0  f
 ginHeapTupleFastCollect
 +   2.68%3084  postmaster  0x192483 f
 build_implied_join_equality
 +   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
 +   1.86%2131  postmaster  0x794aa  f get_collation_oid
 +   1.56%1822  postmaster  0x5a67e  f ginHeapTupleFastInsert
 +   1.53%1766  postmaster  0x1929bc f
 distribute_qual_to_rels
 +   1.33%1558  postmaster  0x249671 f cmp_numerics

 I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
 method name.

 That's about the sum of it. Any help would be greatly appreciated and if you
 want any more information about our setup, please feel free to ask.


Reducing shared buffers to around 2gb will probably make the problem go away

*) What's your ratio reads to writes (approximately)?

*) How many connections when it happens.   Do connections pile on after that?

*) Are you willing to run custom patched postmaster to help
troubleshoot the problem?

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-11 Thread Andres Freund
On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
  I've been seeing a strange issue with our Postgres install for about a year
  now, and I was hoping someone might be able to help point me at the cause.
  At what seem like fairly random intervals Postgres will become unresponsive
  to the 3 application nodes it services. These periods tend to last for 10 -
  15 minutes before everything rights itself and the system goes back to
  normal.
 
  During these periods the server will report a spike in the outbound
  bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
  context switches / interrupts (normal peaks are around 2k/8k respectively,
  and during these periods they‘ve gone to 15k/22k), and a load average of
  100+. CPU usage stays relatively low, but it’s all system time reported,
  user time goes to zero. It doesn‘t seem to be disk related since we’re
  running with a shared_buffers setting of 24G, which will fit just about our
  entire database into memory, and the IO transactions reported by the server,
  as well as the disk reads reported by Postgres stay consistently low.
 
  We‘ve recently started tracking how long statements take to execute, and
  we’re seeing some really odd numbers. A simple delete by primary key, for
  example, from a table that contains about 280,000 rows, reportedly took
  18h59m46.900s. An update by primary key in that same table was reported as
  7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
  numbers don't seem reasonable at all.
 
  Some other changes we've made to postgresql.conf:
 
  synchronous_commit = off
 
  maintenance_work_mem = 1GB
  wal_level = hot_standby
  wal_buffers = 16MB
 
  max_wal_senders = 10
 
  wal_keep_segments = 5000
 
  checkpoint_segments = 128
 
  checkpoint_timeout = 30min
 
  checkpoint_completion_target = 0.9
 
  max_connections = 500
 
  The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
  RAM, running Cent OS 6.3.
 
  So far we‘ve tried disabling Transparent Huge Pages after I found a number
  of resources online that indicated similar interrupt/context switch issues,
  but it hasn’t resolve the problem. I managed to catch it happening once and
  run a perf which showed:
 
  +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
  +   9.55%   10956  postmaster  0x2dc820 f set_config_option
  +   8.64%9946  postmaster  0x5a3d4  f writeListPage
  +   5.75%6609  postmaster  0x5a2b0  f
  ginHeapTupleFastCollect
  +   2.68%3084  postmaster  0x192483 f
  build_implied_join_equality
  +   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
  +   1.86%2131  postmaster  0x794aa  f get_collation_oid
  +   1.56%1822  postmaster  0x5a67e  f ginHeapTupleFastInsert
  +   1.53%1766  postmaster  0x1929bc f
  distribute_qual_to_rels
  +   1.33%1558  postmaster  0x249671 f cmp_numerics
 
  I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
  method name.

Try converting it to something more meaningful with addr2line, that
often has more sucess.

  That's about the sum of it. Any help would be greatly appreciated and if you
  want any more information about our setup, please feel free to ask.

 Reducing shared buffers to around 2gb will probably make the problem go away

That profile doesn't really look like one of the problem you are
referring to would look like.

Based on the profile I'd guess it's possible that you're seing problems
with GIN's fastupdate mechanism.
Try ALTER INDEX whatever SET (FASTUPDATE = OFF); VACUUM
whatever's_table for all gin indexes.

It's curious that set_config_option is so high in the profile... Any
chance you could recompile postgres with -fno-omit-frame-pointers in
CFLAGS? That would allow you to use perf -g. The performance price of
that usually is below 1% for postgres.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-11 Thread Merlin Moncure
On Wed, Sep 11, 2013 at 12:17 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
  I've been seeing a strange issue with our Postgres install for about a year
  now, and I was hoping someone might be able to help point me at the cause.
  At what seem like fairly random intervals Postgres will become unresponsive
  to the 3 application nodes it services. These periods tend to last for 10 -
  15 minutes before everything rights itself and the system goes back to
  normal.
 
  During these periods the server will report a spike in the outbound
  bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
  context switches / interrupts (normal peaks are around 2k/8k respectively,
  and during these periods they‘ve gone to 15k/22k), and a load average of
  100+. CPU usage stays relatively low, but it’s all system time reported,
  user time goes to zero. It doesn‘t seem to be disk related since we’re
  running with a shared_buffers setting of 24G, which will fit just about our
  entire database into memory, and the IO transactions reported by the 
  server,
  as well as the disk reads reported by Postgres stay consistently low.
 
  We‘ve recently started tracking how long statements take to execute, and
  we’re seeing some really odd numbers. A simple delete by primary key, for
  example, from a table that contains about 280,000 rows, reportedly took
  18h59m46.900s. An update by primary key in that same table was reported as
  7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
  numbers don't seem reasonable at all.
 
  Some other changes we've made to postgresql.conf:
 
  synchronous_commit = off
 
  maintenance_work_mem = 1GB
  wal_level = hot_standby
  wal_buffers = 16MB
 
  max_wal_senders = 10
 
  wal_keep_segments = 5000
 
  checkpoint_segments = 128
 
  checkpoint_timeout = 30min
 
  checkpoint_completion_target = 0.9
 
  max_connections = 500
 
  The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
  RAM, running Cent OS 6.3.
 
  So far we‘ve tried disabling Transparent Huge Pages after I found a number
  of resources online that indicated similar interrupt/context switch issues,
  but it hasn’t resolve the problem. I managed to catch it happening once and
  run a perf which showed:
 
  +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
  +   9.55%   10956  postmaster  0x2dc820 f set_config_option
  +   8.64%9946  postmaster  0x5a3d4  f writeListPage
  +   5.75%6609  postmaster  0x5a2b0  f
  ginHeapTupleFastCollect
  +   2.68%3084  postmaster  0x192483 f
  build_implied_join_equality
  +   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
  +   1.86%2131  postmaster  0x794aa  f get_collation_oid
  +   1.56%1822  postmaster  0x5a67e  f 
  ginHeapTupleFastInsert
  +   1.53%1766  postmaster  0x1929bc f
  distribute_qual_to_rels
  +   1.33%1558  postmaster  0x249671 f cmp_numerics
 
  I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
  method name.

 Try converting it to something more meaningful with addr2line, that
 often has more sucess.

  That's about the sum of it. Any help would be greatly appreciated and if 
  you
  want any more information about our setup, please feel free to ask.

 Reducing shared buffers to around 2gb will probably make the problem go away

 That profile doesn't really look like one of the problem you are
 referring to would look like.

yup -- I think you're right.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi All,

I've been seeing a strange issue with our Postgres install for about a year
now, and I was hoping someone might be able to help point me at the cause.
At what seem like fairly random intervals Postgres will become unresponsive
to the 3 application nodes it services. These periods tend to last for 10 -
15 minutes before everything rights itself and the system goes back to
normal.

During these periods the server will report a spike in the outbound
bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
context switches / interrupts (normal peaks are around 2k/8k respectively,
and during these periods they‘ve gone to 15k/22k), and a load average of
100+. CPU usage stays relatively low, but it’s all system time reported,
user time goes to zero. It doesn‘t seem to be disk related since we’re
running with a shared_buffers setting of 24G, which will fit just about our
entire database into memory, and the IO transactions reported by the
server, as well as the disk reads reported by Postgres stay consistently
low.

We‘ve recently started tracking how long statements take to execute, and
we’re seeing some really odd numbers. A simple delete by primary key, for
example, from a table that contains about 280,000 rows, reportedly took
18h59m46.900s. An update by primary key in that same table was reported as
7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
numbers don't seem reasonable at all.

Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
RAM, running Cent OS 6.3.

So far we‘ve tried disabling Transparent Huge Pages after I found a number
of resources online that indicated similar interrupt/context switch issues,
but it hasn’t resolve the problem. I managed to catch it happening once and
run a perf which showed:


+  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
+   9.55%   10956  postmaster  0x2dc820 f
set_config_option
+   8.64%9946  postmaster  0x5a3d4  f writeListPage
+   5.75%6609  postmaster  0x5a2b0  f
ginHeapTupleFastCollect
+   2.68%3084  postmaster  0x192483 f
build_implied_join_equality
+   2.61%2990  postmaster  0x187a55 f
build_paths_for_OR
+   1.86%2131  postmaster  0x794aa  f
get_collation_oid
+   1.56%1822  postmaster  0x5a67e  f
ginHeapTupleFastInsert
+   1.53%1766  postmaster  0x1929bc f
distribute_qual_to_rels
+   1.33%1558  postmaster  0x249671 f cmp_numerics

I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
method name.

That's about the sum of it. Any help would be greatly appreciated and if
you want any more information about our setup, please feel free to ask.

Thanks,
Dave


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread Andrew Dunstan


On 09/10/2013 11:04 AM, David Whittaker wrote:


Hi All,

I've been seeing a strange issue with our Postgres install for about a 
year now, and I was hoping someone might be able to help point me at 
the cause. At what seem like fairly random intervals Postgres will 
become unresponsive to the 3 application nodes it services. These 
periods tend to last for 10 - 15 minutes before everything rights 
itself and the system goes back to normal.


During these periods the server will report a spike in the outbound 
bandwidth (from about 1mbs to about 5mbs most recently), a huge spike 
in context switches / interrupts (normal peaks are around 2k/8k 
respectively, and during these periods they‘ve gone to 15k/22k), and a 
load average of 100+. CPU usage stays relatively low, but it’s all 
system time reported, user time goes to zero. It doesn‘t seem to be 
disk related since we’re running with a shared_buffers setting of 24G, 
which will fit just about our entire database into memory, and the IO 
transactions reported by the server, as well as the disk reads 
reported by Postgres stay consistently low.


We‘ve recently started tracking how long statements take to execute, 
and we’re seeing some really odd numbers. A simple delete by primary 
key, for example, from a table that contains about 280,000 rows, 
reportedly took 18h59m46.900s. An update by primary key in that same 
table was reported as 7d 17h 58m 30.415s. That table is frequently 
accessed, but obviously those numbers don't seem reasonable at all.


Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB 
of RAM, running Cent OS 6.3.


So far we‘ve tried disabling Transparent Huge Pages after I found a 
number of resources online that indicated similar interrupt/context 
switch issues, but it hasn’t resolve the problem. I managed to catch 
it happening once and run a perf which showed:


|
+  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
+   9.55%   10956  postmaster  0x2dc820 f set_config_option
+   8.64%9946  postmaster  0x5a3d4  f writeListPage
+   5.75%6609  postmaster  0x5a2b0  f ginHeapTupleFastCollect
+   2.68%3084  postmaster  0x192483 f 
build_implied_join_equality
+   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
+   1.86%2131  postmaster  0x794aa  f get_collation_oid
+   1.56%1822  postmaster  0x5a67e  f ginHeapTupleFastInsert
+   1.53%1766  postmaster  0x1929bc f distribute_qual_to_rels
+   1.33%1558  postmaster  0x249671 f cmp_numerics|

I‘m not sure what 0x347ba9 represents, or why it’s an address rather 
than a method name.


That's about the sum of it. Any help would be greatly appreciated and 
if you want any more information about our setup, please feel free to ask.





I have seen cases like this with very high shared_buffers settings.

24Gb for shared_buffers is quite high, especially on a 48Gb box. What 
happens if you dial that back to, say, 12Gb?


cheers

andrew



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread k...@rice.edu
On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote:
 Hi All,
 
 I've been seeing a strange issue with our Postgres install for about a year
 now, and I was hoping someone might be able to help point me at the cause.
 At what seem like fairly random intervals Postgres will become unresponsive
 to the 3 application nodes it services. These periods tend to last for 10 -
 15 minutes before everything rights itself and the system goes back to
 normal.
 
 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+. CPU usage stays relatively low, but it’s all system time reported,
 user time goes to zero. It doesn‘t seem to be disk related since we’re
 running with a shared_buffers setting of 24G, which will fit just about our
 entire database into memory, and the IO transactions reported by the
 server, as well as the disk reads reported by Postgres stay consistently
 low.
 
 We‘ve recently started tracking how long statements take to execute, and
 we’re seeing some really odd numbers. A simple delete by primary key, for
 example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.
 
 Some other changes we've made to postgresql.conf:
 
 synchronous_commit = off
 
 maintenance_work_mem = 1GB
 wal_level = hot_standby
 wal_buffers = 16MB
 
 max_wal_senders = 10
 
 wal_keep_segments = 5000
 
 checkpoint_segments = 128
 
 checkpoint_timeout = 30min
 
 checkpoint_completion_target = 0.9
 
 max_connections = 500
 
 The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
 RAM, running Cent OS 6.3.
 
 So far we‘ve tried disabling Transparent Huge Pages after I found a number
 of resources online that indicated similar interrupt/context switch issues,
 but it hasn’t resolve the problem. I managed to catch it happening once and
 run a perf which showed:
 
 
 +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
 +   9.55%   10956  postmaster  0x2dc820 f
 set_config_option
 +   8.64%9946  postmaster  0x5a3d4  f writeListPage
 +   5.75%6609  postmaster  0x5a2b0  f
 ginHeapTupleFastCollect
 +   2.68%3084  postmaster  0x192483 f
 build_implied_join_equality
 +   2.61%2990  postmaster  0x187a55 f
 build_paths_for_OR
 +   1.86%2131  postmaster  0x794aa  f
 get_collation_oid
 +   1.56%1822  postmaster  0x5a67e  f
 ginHeapTupleFastInsert
 +   1.53%1766  postmaster  0x1929bc f
 distribute_qual_to_rels
 +   1.33%1558  postmaster  0x249671 f cmp_numerics
 
 I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
 method name.
 
 That's about the sum of it. Any help would be greatly appreciated and if
 you want any more information about our setup, please feel free to ask.
 
 Thanks,
 Dave

Hi Dave,

A load average of 100+ means that you have that many processes waiting to
run yet you only have 16 cpus. You really need to consider using a connection
pooler like pgbouncer to keep your connection count in the 16-32 range.

Regards,
Ken


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi Andrew,


On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan and...@dunslane.netwrote:


 On 09/10/2013 11:04 AM, David Whittaker wrote:


 Hi All,

 I've been seeing a strange issue with our Postgres install for about a
 year now, and I was hoping someone might be able to help point me at the
 cause. At what seem like fairly random intervals Postgres will become
 unresponsive to the 3 application nodes it services. These periods tend to
 last for 10 - 15 minutes before everything rights itself and the system
 goes back to normal.

 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+. CPU usage stays relatively low, but it’s all system time reported,
 user time goes to zero. It doesn‘t seem to be disk related since we’re
 running with a shared_buffers setting of 24G, which will fit just about our
 entire database into memory, and the IO transactions reported by the
 server, as well as the disk reads reported by Postgres stay consistently
 low.

 We‘ve recently started tracking how long statements take to execute, and
 we’re seeing some really odd numbers. A simple delete by primary key, for
 example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.

 Some other changes we've made to postgresql.conf:

 synchronous_commit = off

 maintenance_work_mem = 1GB
 wal_level = hot_standby
 wal_buffers = 16MB

 max_wal_senders = 10

 wal_keep_segments = 5000

 checkpoint_segments = 128

 checkpoint_timeout = 30min

 checkpoint_completion_target = 0.9

 max_connections = 500

 The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
 RAM, running Cent OS 6.3.

 So far we‘ve tried disabling Transparent Huge Pages after I found a
 number of resources online that indicated similar interrupt/context switch
 issues, but it hasn’t resolve the problem. I managed to catch it happening
 once and run a perf which showed:

 |
 +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
 +   9.55%   10956  postmaster  0x2dc820 f set_config_option
 +   8.64%9946  postmaster  0x5a3d4  f writeListPage
 +   5.75%6609  postmaster  0x5a2b0  f
 ginHeapTupleFastCollect
 +   2.68%3084  postmaster  0x192483 f
 build_implied_join_equality
 +   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
 +   1.86%2131  postmaster  0x794aa  f get_collation_oid
 +   1.56%1822  postmaster  0x5a67e  f
 ginHeapTupleFastInsert
 +   1.53%1766  postmaster  0x1929bc f
 distribute_qual_to_rels
 +   1.33%1558  postmaster  0x249671 f cmp_numerics|

 I‘m not sure what 0x347ba9 represents, or why it’s an address rather than
 a method name.

 That's about the sum of it. Any help would be greatly appreciated and if
 you want any more information about our setup, please feel free to ask.



 I have seen cases like this with very high shared_buffers settings.

 24Gb for shared_buffers is quite high, especially on a 48Gb box. What
 happens if you dial that back to, say, 12Gb?


I'd be willing to give it a try.  I'd really like to understand what's
going on here though.  Can you elaborate on that?  Why would 24G of shared
buffers be too high in this case?  The machine is devoted entirely to PG,
so having PG use half of the available RAM to cache data doesn't feel
unreasonable.



 cheers

 andrew




Re: [PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread Jeff Janes
On Tue, Sep 10, 2013 at 8:04 AM, David Whittaker d...@iradix.com wrote:

 Hi All,

 I've been seeing a strange issue with our Postgres install for about a
 year now, and I was hoping someone might be able to help point me at the
 cause. At what seem like fairly random intervals Postgres will become
 unresponsive to the 3 application nodes it services. These periods tend to
 last for 10 - 15 minutes before everything rights itself and the system
 goes back to normal.

 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+.


I'm curious about the spike it outbound network usage.  If the database is
hung and no longer responding to queries, what is it getting sent over the
network?  Can you snoop on that traffic?


 CPU usage stays relatively low, but it’s all system time reported, user
 time goes to zero. It doesn‘t seem to be disk related since we’re running
 with a shared_buffers setting of 24G, which will fit just about our entire
 database into memory, and the IO transactions reported by the server, as
 well as the disk reads reported by Postgres stay consistently low.

There have been reports that using very large shared_buffers can cause a
lot of contention issues in the kernel, for some kernels. The usual advice
is not to set shared_buffers above 8GB.  The operating system can use the
rest of the memory to cache for you.

Also, using a connection pooler and lowering the number of connections to
the real database has solved problems like this before.


 We‘ve recently started tracking how long statements take to execute, and
 we’re seeing some really odd numbers. A simple delete by primary key, for
 example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.

How are your tracking those?  Is it log_min_duration_statement or something
else?

Cheers,

Jeff


[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi Ken,


On Tue, Sep 10, 2013 at 11:33 AM, k...@rice.edu k...@rice.edu wrote:

 On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote:
  Hi All,
 
  I've been seeing a strange issue with our Postgres install for about a
 year
  now, and I was hoping someone might be able to help point me at the
 cause.
  At what seem like fairly random intervals Postgres will become
 unresponsive
  to the 3 application nodes it services. These periods tend to last for
 10 -
  15 minutes before everything rights itself and the system goes back to
  normal.
 
  During these periods the server will report a spike in the outbound
  bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
  context switches / interrupts (normal peaks are around 2k/8k
 respectively,
  and during these periods they‘ve gone to 15k/22k), and a load average of
  100+. CPU usage stays relatively low, but it’s all system time reported,
  user time goes to zero. It doesn‘t seem to be disk related since we’re
  running with a shared_buffers setting of 24G, which will fit just about
 our
  entire database into memory, and the IO transactions reported by the
  server, as well as the disk reads reported by Postgres stay consistently
  low.
 
  We‘ve recently started tracking how long statements take to execute, and
  we’re seeing some really odd numbers. A simple delete by primary key, for
  example, from a table that contains about 280,000 rows, reportedly took
  18h59m46.900s. An update by primary key in that same table was reported
 as
  7d 17h 58m 30.415s. That table is frequently accessed, but obviously
 those
  numbers don't seem reasonable at all.
 
  Some other changes we've made to postgresql.conf:
 
  synchronous_commit = off
 
  maintenance_work_mem = 1GB
  wal_level = hot_standby
  wal_buffers = 16MB
 
  max_wal_senders = 10
 
  wal_keep_segments = 5000
 
  checkpoint_segments = 128
 
  checkpoint_timeout = 30min
 
  checkpoint_completion_target = 0.9
 
  max_connections = 500
 
  The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
  RAM, running Cent OS 6.3.
 
  So far we‘ve tried disabling Transparent Huge Pages after I found a
 number
  of resources online that indicated similar interrupt/context switch
 issues,
  but it hasn’t resolve the problem. I managed to catch it happening once
 and
  run a perf which showed:
 
 
  +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
  +   9.55%   10956  postmaster  0x2dc820 f
  set_config_option
  +   8.64%9946  postmaster  0x5a3d4  f writeListPage
  +   5.75%6609  postmaster  0x5a2b0  f
  ginHeapTupleFastCollect
  +   2.68%3084  postmaster  0x192483 f
  build_implied_join_equality
  +   2.61%2990  postmaster  0x187a55 f
  build_paths_for_OR
  +   1.86%2131  postmaster  0x794aa  f
  get_collation_oid
  +   1.56%1822  postmaster  0x5a67e  f
  ginHeapTupleFastInsert
  +   1.53%1766  postmaster  0x1929bc f
  distribute_qual_to_rels
  +   1.33%1558  postmaster  0x249671 f cmp_numerics
 
  I‘m not sure what 0x347ba9 represents, or why it’s an address rather
 than a
  method name.
 
  That's about the sum of it. Any help would be greatly appreciated and if
  you want any more information about our setup, please feel free to ask.
 
  Thanks,
  Dave

 Hi Dave,

 A load average of 100+ means that you have that many processes waiting to
 run yet you only have 16 cpus. You really need to consider using a
 connection
 pooler like pgbouncer to keep your connection count in the 16-32 range.


That would make sense if the issues corresponded to increased load, but
they don't. I understand that the load spike is caused by waiting
processed, but it doesn't seem to correspond to a transaction spike.  The
number of transactions per second appear to stay in-line with normal usage
when these issues occur.  I do see an increase in postmaster processes when
it happens, but they don't seem to have entered a transaction yet.  Coupled
with the fact that cpu usage is all system time, and the context switch /
interrupt spikes, I feel like something must be going on behind the scenes
leading to these problems.  I'm just not sure what that something is.


 Regards,
 Ken



[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi Jeff

On Tue, Sep 10, 2013 at 1:44 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Sep 10, 2013 at 8:04 AM, David Whittaker d...@iradix.com wrote:

 Hi All,

 I've been seeing a strange issue with our Postgres install for about a
 year now, and I was hoping someone might be able to help point me at the
 cause. At what seem like fairly random intervals Postgres will become
 unresponsive to the 3 application nodes it services. These periods tend to
 last for 10 - 15 minutes before everything rights itself and the system
 goes back to normal.

 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+.


 I'm curious about the spike it outbound network usage.  If the database is
 hung and no longer responding to queries, what is it getting sent over the
 network?  Can you snoop on that traffic?



It seems curious to me as well.  I don't know if one, or a few of the pg
connections are streaming out data and somehow blocking the others in the
process, or the data could be unrelated to pg.  If you can suggest a tool I
could use to monitor the data transfer continuously and get some type of
summary of what happened after the issue reoccurs, I'd appreciate it.
 Otherwise, I'll try to get in and catch some specifics the next time it
happens.




 CPU usage stays relatively low, but it’s all system time reported, user
 time goes to zero. It doesn‘t seem to be disk related since we’re running
 with a shared_buffers setting of 24G, which will fit just about our entire
 database into memory, and the IO transactions reported by the server, as
 well as the disk reads reported by Postgres stay consistently low.

 There have been reports that using very large shared_buffers can cause a
 lot of contention issues in the kernel, for some kernels. The usual advice
 is not to set shared_buffers above 8GB.  The operating system can use the
 rest of the memory to cache for you.

 Also, using a connection pooler and lowering the number of connections to
 the real database has solved problems like this before.


We're going to implement both of these changes tonight.  I was going to go
with 12G for shared_buffers based on Andrew's suggestion, but maybe I'll go
down to 8 if that seems to be the magic number.  We're also going to
decrease the max connections from 500 to 100 and decrease the pooled
connections per server.




  We‘ve recently started tracking how long statements take to execute,
 and we’re seeing some really odd numbers. A simple delete by primary key,
 for example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.

 How are your tracking those?  Is it log_min_duration_statement or
 something else?


We're using log_min_duration_statement = 1000, sending the log messages to
syslog, then analyzing with pg_badger.



 Cheers,

 Jeff