Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-09-10 Thread Bruce Momjian
On Tue, Sep 10, 2013 at 05:43:16AM +, Mel Llaguno wrote:
 Bruce,

 First of all, I'd like to thank you for taking some interest in this
 issue. We'd love to migrate to the latest PG version, but this issue
 is currently preventing us from doing so.

 Regardless of the DB used (base application schema _or_ restored DB
 with additional app data + base application schema), seed information
 is present in all tests. I guess my question is this - why would
 having existing data change the bind behavior at all? Is it possible
 that the way indexes are created has changed between 8.4 - 9.x?

I don't know as you have not shown us exactly what is slower between
versions --- you only said the bug appears or not in certain
circumstances.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-09-10 Thread Andres Freund
On 2013-09-09 20:38:09 -0400, Andrew Dunstan wrote:
 
 On 08/01/2013 03:20 PM, Jeff Janes wrote:
 On Thu, Aug 1, 2013 at 10:58 AM, Josh Berkus j...@agliodbs.com wrote:
 Amit, All:
 
 So we just retested this on 9.3b2.  The performance is the same as 9.1
 and 9.2; that is, progressively worse as the test cycles go on, and
 unacceptably slow compared to 8.4.
 
 Some issue introduced in 9.1 is causing BINDs to get progressively
 slower as the PARSEs BINDs get run repeatedly.  Per earlier on this
 thread, that can bloat to 200X time required for a BIND, and it's
 definitely PostgreSQL-side.
 
 I'm trying to produce a test case which doesn't involve the user's
 application.  However, hints on other things to analyze would be keen.
 Does it seem to be all CPU time (it is hard to imagine what else it
 would be, but...)
 
 Could you use oprofile or perf or gprof to get a profile of the
 backend during a run?  That should quickly narrow it down to which C
 function has the problem.
 
 Did you test 9.0 as well?
 
 
 This has been tested back to 9.0. What we have found is that the problem
 disappears if the database has come in via dump/restore, but is present if
 it is the result of pg_upgrade. There are some long-running transactions
 also running alongside this - we are currently planning a test where those
 are not present. We're also looking at constructing a self-contained test
 case.
 
 Here is some perf output from the bad case:
 
+  14.67%  postgres   [.] heap_hot_search_buffer
+  11.45%  postgres   [.] LWLockAcquire
+   8.39%  postgres   [.] LWLockRelease
+   6.60%  postgres   [.] _bt_checkkeys
+   6.39%  postgres   [.] PinBuffer
+   5.96%  postgres   [.] hash_search_with_hash_value
+   5.43%  postgres   [.] hash_any
+   5.14%  postgres   [.] UnpinBuffer
+   3.43%  postgres   [.] ReadBuffer_common
+   2.34%  postgres   [.] index_fetch_heap
+   2.04%  postgres   [.] heap_page_prune_opt
+   2.00%  libc-2.15.so   [.] 0x8041b
+   1.94%  postgres   [.] _bt_next
+   1.83%  postgres   [.] btgettuple
+   1.76%  postgres   [.] index_getnext_tid
+   1.70%  postgres   [.] LockBuffer
+   1.54%  postgres   [.] ReadBufferExtended
+   1.25%  postgres   [.] FunctionCall2Coll
+   1.14%  postgres   [.] HeapTupleSatisfiesNow
+   1.09%  postgres   [.] ReleaseAndReadBuffer
+   0.94%  postgres   [.] ResourceOwnerForgetBuffer
+   0.81%  postgres   [.] _bt_saveitem
+   0.80%  postgres   [.] _bt_readpage
+   0.79%  [kernel.kallsyms]  [k] 0x81170861
+   0.64%  postgres   [.] CheckForSerializableConflictOut
+   0.60%  postgres   [.] ResourceOwnerEnlargeBuffers
+   0.59%  postgres   [.] BufTableLookup

After a second look at this, I very tentatively guess that you'll see
get_actual_variable_range() as the entry point here. Which would explain
why you're seing this during PARSE.

But there still is the question why we never actually seem to prune...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-09-10 Thread Andres Freund
Hi,

On 2013-09-09 20:38:09 -0400, Andrew Dunstan wrote:
 
 On 08/01/2013 03:20 PM, Jeff Janes wrote:
 On Thu, Aug 1, 2013 at 10:58 AM, Josh Berkus j...@agliodbs.com wrote:
 Amit, All:
 
 So we just retested this on 9.3b2.  The performance is the same as 9.1
 and 9.2; that is, progressively worse as the test cycles go on, and
 unacceptably slow compared to 8.4.
 
 Some issue introduced in 9.1 is causing BINDs to get progressively
 slower as the PARSEs BINDs get run repeatedly.  Per earlier on this
 thread, that can bloat to 200X time required for a BIND, and it's
 definitely PostgreSQL-side.
 
 I'm trying to produce a test case which doesn't involve the user's
 application.  However, hints on other things to analyze would be keen.
 Does it seem to be all CPU time (it is hard to imagine what else it
 would be, but...)
 
 Could you use oprofile or perf or gprof to get a profile of the
 backend during a run?  That should quickly narrow it down to which C
 function has the problem.
 
 Did you test 9.0 as well?
 
 
 This has been tested back to 9.0. What we have found is that the problem
 disappears if the database has come in via dump/restore, but is present if
 it is the result of pg_upgrade. There are some long-running transactions
 also running alongside this - we are currently planning a test where those
 are not present. We're also looking at constructing a self-contained test
 case.
 
 Here is some perf output from the bad case:
 
+  14.67%  postgres   [.] heap_hot_search_buffer
+  11.45%  postgres   [.] LWLockAcquire
+   8.39%  postgres   [.] LWLockRelease
+   6.60%  postgres   [.] _bt_checkkeys
+   6.39%  postgres   [.] PinBuffer
+   5.96%  postgres   [.] hash_search_with_hash_value
+   5.43%  postgres   [.] hash_any
+   5.14%  postgres   [.] UnpinBuffer
+   3.43%  postgres   [.] ReadBuffer_common
+   2.34%  postgres   [.] index_fetch_heap
+   2.04%  postgres   [.] heap_page_prune_opt

A backtrace for this would be useful. Alternatively you could recompile
postgres using -fno-omit-frame-pointer in CFLAGS and use perf record -g.

Any chance you have older prepared xacts, older sessions or something
like that around? I'd expect heap_prune* to be present in workloads that
spend significant time in heap_hot_search_buffer...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-09-10 Thread Andrew Dunstan


On 09/10/2013 08:20 AM, Andres Freund wrote:


A backtrace for this would be useful. Alternatively you could recompile
postgres using -fno-omit-frame-pointer in CFLAGS and use perf record -g.


It's using a custom build, so this should be doable.



Any chance you have older prepared xacts, older sessions or something
like that around? I'd expect heap_prune* to be present in workloads that
spend significant time in heap_hot_search_buffer...



Not sure about prepared transactions. There are certainly probably old 
prepared statements around, and long running transactions alongside this 
one.


cheers

andrew


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-09-10 Thread Andres Freund
On 2013-09-10 08:45:33 -0400, Andrew Dunstan wrote:
 
 On 09/10/2013 08:20 AM, Andres Freund wrote:
 
 A backtrace for this would be useful. Alternatively you could recompile
 postgres using -fno-omit-frame-pointer in CFLAGS and use perf record -g.
 
 It's using a custom build, so this should be doable.

Great.

 Any chance you have older prepared xacts, older sessions or something
 like that around? I'd expect heap_prune* to be present in workloads that
 spend significant time in heap_hot_search_buffer...
 
 
 Not sure about prepared transactions. There are certainly probably old
 prepared statements around, and long running transactions alongside this
 one.

Ok, long running transactions will do the trick. I quicky checked and
doing an index lookup for min/max histogram lookups was added *after*
8.4 which would explain why you're not seing the issue there
(c.f. 40608e7f949fb7e4025c0ddd5be01939adc79eec).

It getting slower and slower during a testrun would be explained by the
additional tuple versions amassing which cannot be marked dead because
of older transactions around. I guess those are also part of the test?

If I interpret things correctly you're using serializable? I guess there
is no chance to use repeatable read instead?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-09-10 Thread Andres Freund
On 2013-09-10 15:21:33 +0200, Andres Freund wrote:
 If I interpret things correctly you're using serializable? I guess there
 is no chance to use repeatable read instead?

Err, that wouldn't help much. Read committed. That lets PGXACT-xmin advance
these days and thus might help to reduce the impact of the longrunning
transactions.
Otherwise you will have to shorten those...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-09-10 Thread Andrew Dunstan


On 09/10/2013 09:23 AM, Andres Freund wrote:

On 2013-09-10 15:21:33 +0200, Andres Freund wrote:

If I interpret things correctly you're using serializable? I guess there
is no chance to use repeatable read instead?

Err, that wouldn't help much. Read committed. That lets PGXACT-xmin advance
these days and thus might help to reduce the impact of the longrunning
transactions.
Otherwise you will have to shorten those...




Yeah, we're looking at eliminating them.

cheers

andrew



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi All,

I've been seeing a strange issue with our Postgres install for about a year
now, and I was hoping someone might be able to help point me at the cause.
At what seem like fairly random intervals Postgres will become unresponsive
to the 3 application nodes it services. These periods tend to last for 10 -
15 minutes before everything rights itself and the system goes back to
normal.

During these periods the server will report a spike in the outbound
bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
context switches / interrupts (normal peaks are around 2k/8k respectively,
and during these periods they‘ve gone to 15k/22k), and a load average of
100+. CPU usage stays relatively low, but it’s all system time reported,
user time goes to zero. It doesn‘t seem to be disk related since we’re
running with a shared_buffers setting of 24G, which will fit just about our
entire database into memory, and the IO transactions reported by the
server, as well as the disk reads reported by Postgres stay consistently
low.

We‘ve recently started tracking how long statements take to execute, and
we’re seeing some really odd numbers. A simple delete by primary key, for
example, from a table that contains about 280,000 rows, reportedly took
18h59m46.900s. An update by primary key in that same table was reported as
7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
numbers don't seem reasonable at all.

Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
RAM, running Cent OS 6.3.

So far we‘ve tried disabling Transparent Huge Pages after I found a number
of resources online that indicated similar interrupt/context switch issues,
but it hasn’t resolve the problem. I managed to catch it happening once and
run a perf which showed:


+  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
+   9.55%   10956  postmaster  0x2dc820 f
set_config_option
+   8.64%9946  postmaster  0x5a3d4  f writeListPage
+   5.75%6609  postmaster  0x5a2b0  f
ginHeapTupleFastCollect
+   2.68%3084  postmaster  0x192483 f
build_implied_join_equality
+   2.61%2990  postmaster  0x187a55 f
build_paths_for_OR
+   1.86%2131  postmaster  0x794aa  f
get_collation_oid
+   1.56%1822  postmaster  0x5a67e  f
ginHeapTupleFastInsert
+   1.53%1766  postmaster  0x1929bc f
distribute_qual_to_rels
+   1.33%1558  postmaster  0x249671 f cmp_numerics

I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
method name.

That's about the sum of it. Any help would be greatly appreciated and if
you want any more information about our setup, please feel free to ask.

Thanks,
Dave


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread Andrew Dunstan


On 09/10/2013 11:04 AM, David Whittaker wrote:


Hi All,

I've been seeing a strange issue with our Postgres install for about a 
year now, and I was hoping someone might be able to help point me at 
the cause. At what seem like fairly random intervals Postgres will 
become unresponsive to the 3 application nodes it services. These 
periods tend to last for 10 - 15 minutes before everything rights 
itself and the system goes back to normal.


During these periods the server will report a spike in the outbound 
bandwidth (from about 1mbs to about 5mbs most recently), a huge spike 
in context switches / interrupts (normal peaks are around 2k/8k 
respectively, and during these periods they‘ve gone to 15k/22k), and a 
load average of 100+. CPU usage stays relatively low, but it’s all 
system time reported, user time goes to zero. It doesn‘t seem to be 
disk related since we’re running with a shared_buffers setting of 24G, 
which will fit just about our entire database into memory, and the IO 
transactions reported by the server, as well as the disk reads 
reported by Postgres stay consistently low.


We‘ve recently started tracking how long statements take to execute, 
and we’re seeing some really odd numbers. A simple delete by primary 
key, for example, from a table that contains about 280,000 rows, 
reportedly took 18h59m46.900s. An update by primary key in that same 
table was reported as 7d 17h 58m 30.415s. That table is frequently 
accessed, but obviously those numbers don't seem reasonable at all.


Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB 
of RAM, running Cent OS 6.3.


So far we‘ve tried disabling Transparent Huge Pages after I found a 
number of resources online that indicated similar interrupt/context 
switch issues, but it hasn’t resolve the problem. I managed to catch 
it happening once and run a perf which showed:


|
+  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
+   9.55%   10956  postmaster  0x2dc820 f set_config_option
+   8.64%9946  postmaster  0x5a3d4  f writeListPage
+   5.75%6609  postmaster  0x5a2b0  f ginHeapTupleFastCollect
+   2.68%3084  postmaster  0x192483 f 
build_implied_join_equality
+   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
+   1.86%2131  postmaster  0x794aa  f get_collation_oid
+   1.56%1822  postmaster  0x5a67e  f ginHeapTupleFastInsert
+   1.53%1766  postmaster  0x1929bc f distribute_qual_to_rels
+   1.33%1558  postmaster  0x249671 f cmp_numerics|

I‘m not sure what 0x347ba9 represents, or why it’s an address rather 
than a method name.


That's about the sum of it. Any help would be greatly appreciated and 
if you want any more information about our setup, please feel free to ask.





I have seen cases like this with very high shared_buffers settings.

24Gb for shared_buffers is quite high, especially on a 48Gb box. What 
happens if you dial that back to, say, 12Gb?


cheers

andrew



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread k...@rice.edu
On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote:
 Hi All,
 
 I've been seeing a strange issue with our Postgres install for about a year
 now, and I was hoping someone might be able to help point me at the cause.
 At what seem like fairly random intervals Postgres will become unresponsive
 to the 3 application nodes it services. These periods tend to last for 10 -
 15 minutes before everything rights itself and the system goes back to
 normal.
 
 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+. CPU usage stays relatively low, but it’s all system time reported,
 user time goes to zero. It doesn‘t seem to be disk related since we’re
 running with a shared_buffers setting of 24G, which will fit just about our
 entire database into memory, and the IO transactions reported by the
 server, as well as the disk reads reported by Postgres stay consistently
 low.
 
 We‘ve recently started tracking how long statements take to execute, and
 we’re seeing some really odd numbers. A simple delete by primary key, for
 example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.
 
 Some other changes we've made to postgresql.conf:
 
 synchronous_commit = off
 
 maintenance_work_mem = 1GB
 wal_level = hot_standby
 wal_buffers = 16MB
 
 max_wal_senders = 10
 
 wal_keep_segments = 5000
 
 checkpoint_segments = 128
 
 checkpoint_timeout = 30min
 
 checkpoint_completion_target = 0.9
 
 max_connections = 500
 
 The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
 RAM, running Cent OS 6.3.
 
 So far we‘ve tried disabling Transparent Huge Pages after I found a number
 of resources online that indicated similar interrupt/context switch issues,
 but it hasn’t resolve the problem. I managed to catch it happening once and
 run a perf which showed:
 
 
 +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
 +   9.55%   10956  postmaster  0x2dc820 f
 set_config_option
 +   8.64%9946  postmaster  0x5a3d4  f writeListPage
 +   5.75%6609  postmaster  0x5a2b0  f
 ginHeapTupleFastCollect
 +   2.68%3084  postmaster  0x192483 f
 build_implied_join_equality
 +   2.61%2990  postmaster  0x187a55 f
 build_paths_for_OR
 +   1.86%2131  postmaster  0x794aa  f
 get_collation_oid
 +   1.56%1822  postmaster  0x5a67e  f
 ginHeapTupleFastInsert
 +   1.53%1766  postmaster  0x1929bc f
 distribute_qual_to_rels
 +   1.33%1558  postmaster  0x249671 f cmp_numerics
 
 I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
 method name.
 
 That's about the sum of it. Any help would be greatly appreciated and if
 you want any more information about our setup, please feel free to ask.
 
 Thanks,
 Dave

Hi Dave,

A load average of 100+ means that you have that many processes waiting to
run yet you only have 16 cpus. You really need to consider using a connection
pooler like pgbouncer to keep your connection count in the 16-32 range.

Regards,
Ken


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi Andrew,


On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan and...@dunslane.netwrote:


 On 09/10/2013 11:04 AM, David Whittaker wrote:


 Hi All,

 I've been seeing a strange issue with our Postgres install for about a
 year now, and I was hoping someone might be able to help point me at the
 cause. At what seem like fairly random intervals Postgres will become
 unresponsive to the 3 application nodes it services. These periods tend to
 last for 10 - 15 minutes before everything rights itself and the system
 goes back to normal.

 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+. CPU usage stays relatively low, but it’s all system time reported,
 user time goes to zero. It doesn‘t seem to be disk related since we’re
 running with a shared_buffers setting of 24G, which will fit just about our
 entire database into memory, and the IO transactions reported by the
 server, as well as the disk reads reported by Postgres stay consistently
 low.

 We‘ve recently started tracking how long statements take to execute, and
 we’re seeing some really odd numbers. A simple delete by primary key, for
 example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.

 Some other changes we've made to postgresql.conf:

 synchronous_commit = off

 maintenance_work_mem = 1GB
 wal_level = hot_standby
 wal_buffers = 16MB

 max_wal_senders = 10

 wal_keep_segments = 5000

 checkpoint_segments = 128

 checkpoint_timeout = 30min

 checkpoint_completion_target = 0.9

 max_connections = 500

 The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
 RAM, running Cent OS 6.3.

 So far we‘ve tried disabling Transparent Huge Pages after I found a
 number of resources online that indicated similar interrupt/context switch
 issues, but it hasn’t resolve the problem. I managed to catch it happening
 once and run a perf which showed:

 |
 +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
 +   9.55%   10956  postmaster  0x2dc820 f set_config_option
 +   8.64%9946  postmaster  0x5a3d4  f writeListPage
 +   5.75%6609  postmaster  0x5a2b0  f
 ginHeapTupleFastCollect
 +   2.68%3084  postmaster  0x192483 f
 build_implied_join_equality
 +   2.61%2990  postmaster  0x187a55 f build_paths_for_OR
 +   1.86%2131  postmaster  0x794aa  f get_collation_oid
 +   1.56%1822  postmaster  0x5a67e  f
 ginHeapTupleFastInsert
 +   1.53%1766  postmaster  0x1929bc f
 distribute_qual_to_rels
 +   1.33%1558  postmaster  0x249671 f cmp_numerics|

 I‘m not sure what 0x347ba9 represents, or why it’s an address rather than
 a method name.

 That's about the sum of it. Any help would be greatly appreciated and if
 you want any more information about our setup, please feel free to ask.



 I have seen cases like this with very high shared_buffers settings.

 24Gb for shared_buffers is quite high, especially on a 48Gb box. What
 happens if you dial that back to, say, 12Gb?


I'd be willing to give it a try.  I'd really like to understand what's
going on here though.  Can you elaborate on that?  Why would 24G of shared
buffers be too high in this case?  The machine is devoted entirely to PG,
so having PG use half of the available RAM to cache data doesn't feel
unreasonable.



 cheers

 andrew




Re: [PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread Jeff Janes
On Tue, Sep 10, 2013 at 8:04 AM, David Whittaker d...@iradix.com wrote:

 Hi All,

 I've been seeing a strange issue with our Postgres install for about a
 year now, and I was hoping someone might be able to help point me at the
 cause. At what seem like fairly random intervals Postgres will become
 unresponsive to the 3 application nodes it services. These periods tend to
 last for 10 - 15 minutes before everything rights itself and the system
 goes back to normal.

 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+.


I'm curious about the spike it outbound network usage.  If the database is
hung and no longer responding to queries, what is it getting sent over the
network?  Can you snoop on that traffic?


 CPU usage stays relatively low, but it’s all system time reported, user
 time goes to zero. It doesn‘t seem to be disk related since we’re running
 with a shared_buffers setting of 24G, which will fit just about our entire
 database into memory, and the IO transactions reported by the server, as
 well as the disk reads reported by Postgres stay consistently low.

There have been reports that using very large shared_buffers can cause a
lot of contention issues in the kernel, for some kernels. The usual advice
is not to set shared_buffers above 8GB.  The operating system can use the
rest of the memory to cache for you.

Also, using a connection pooler and lowering the number of connections to
the real database has solved problems like this before.


 We‘ve recently started tracking how long statements take to execute, and
 we’re seeing some really odd numbers. A simple delete by primary key, for
 example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.

How are your tracking those?  Is it log_min_duration_statement or something
else?

Cheers,

Jeff


[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi Ken,


On Tue, Sep 10, 2013 at 11:33 AM, k...@rice.edu k...@rice.edu wrote:

 On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote:
  Hi All,
 
  I've been seeing a strange issue with our Postgres install for about a
 year
  now, and I was hoping someone might be able to help point me at the
 cause.
  At what seem like fairly random intervals Postgres will become
 unresponsive
  to the 3 application nodes it services. These periods tend to last for
 10 -
  15 minutes before everything rights itself and the system goes back to
  normal.
 
  During these periods the server will report a spike in the outbound
  bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
  context switches / interrupts (normal peaks are around 2k/8k
 respectively,
  and during these periods they‘ve gone to 15k/22k), and a load average of
  100+. CPU usage stays relatively low, but it’s all system time reported,
  user time goes to zero. It doesn‘t seem to be disk related since we’re
  running with a shared_buffers setting of 24G, which will fit just about
 our
  entire database into memory, and the IO transactions reported by the
  server, as well as the disk reads reported by Postgres stay consistently
  low.
 
  We‘ve recently started tracking how long statements take to execute, and
  we’re seeing some really odd numbers. A simple delete by primary key, for
  example, from a table that contains about 280,000 rows, reportedly took
  18h59m46.900s. An update by primary key in that same table was reported
 as
  7d 17h 58m 30.415s. That table is frequently accessed, but obviously
 those
  numbers don't seem reasonable at all.
 
  Some other changes we've made to postgresql.conf:
 
  synchronous_commit = off
 
  maintenance_work_mem = 1GB
  wal_level = hot_standby
  wal_buffers = 16MB
 
  max_wal_senders = 10
 
  wal_keep_segments = 5000
 
  checkpoint_segments = 128
 
  checkpoint_timeout = 30min
 
  checkpoint_completion_target = 0.9
 
  max_connections = 500
 
  The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
  RAM, running Cent OS 6.3.
 
  So far we‘ve tried disabling Transparent Huge Pages after I found a
 number
  of resources online that indicated similar interrupt/context switch
 issues,
  but it hasn’t resolve the problem. I managed to catch it happening once
 and
  run a perf which showed:
 
 
  +  41.40%   48154  postmaster  0x347ba9 f 0x347ba9
  +   9.55%   10956  postmaster  0x2dc820 f
  set_config_option
  +   8.64%9946  postmaster  0x5a3d4  f writeListPage
  +   5.75%6609  postmaster  0x5a2b0  f
  ginHeapTupleFastCollect
  +   2.68%3084  postmaster  0x192483 f
  build_implied_join_equality
  +   2.61%2990  postmaster  0x187a55 f
  build_paths_for_OR
  +   1.86%2131  postmaster  0x794aa  f
  get_collation_oid
  +   1.56%1822  postmaster  0x5a67e  f
  ginHeapTupleFastInsert
  +   1.53%1766  postmaster  0x1929bc f
  distribute_qual_to_rels
  +   1.33%1558  postmaster  0x249671 f cmp_numerics
 
  I‘m not sure what 0x347ba9 represents, or why it’s an address rather
 than a
  method name.
 
  That's about the sum of it. Any help would be greatly appreciated and if
  you want any more information about our setup, please feel free to ask.
 
  Thanks,
  Dave

 Hi Dave,

 A load average of 100+ means that you have that many processes waiting to
 run yet you only have 16 cpus. You really need to consider using a
 connection
 pooler like pgbouncer to keep your connection count in the 16-32 range.


That would make sense if the issues corresponded to increased load, but
they don't. I understand that the load spike is caused by waiting
processed, but it doesn't seem to correspond to a transaction spike.  The
number of transactions per second appear to stay in-line with normal usage
when these issues occur.  I do see an increase in postmaster processes when
it happens, but they don't seem to have entered a transaction yet.  Coupled
with the fact that cpu usage is all system time, and the context switch /
interrupt spikes, I feel like something must be going on behind the scenes
leading to these problems.  I'm just not sure what that something is.


 Regards,
 Ken



[PERFORM] Intermittent hangs with 9.2

2013-09-10 Thread David Whittaker
Hi Jeff

On Tue, Sep 10, 2013 at 1:44 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Sep 10, 2013 at 8:04 AM, David Whittaker d...@iradix.com wrote:

 Hi All,

 I've been seeing a strange issue with our Postgres install for about a
 year now, and I was hoping someone might be able to help point me at the
 cause. At what seem like fairly random intervals Postgres will become
 unresponsive to the 3 application nodes it services. These periods tend to
 last for 10 - 15 minutes before everything rights itself and the system
 goes back to normal.

 During these periods the server will report a spike in the outbound
 bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
 context switches / interrupts (normal peaks are around 2k/8k respectively,
 and during these periods they‘ve gone to 15k/22k), and a load average of
 100+.


 I'm curious about the spike it outbound network usage.  If the database is
 hung and no longer responding to queries, what is it getting sent over the
 network?  Can you snoop on that traffic?



It seems curious to me as well.  I don't know if one, or a few of the pg
connections are streaming out data and somehow blocking the others in the
process, or the data could be unrelated to pg.  If you can suggest a tool I
could use to monitor the data transfer continuously and get some type of
summary of what happened after the issue reoccurs, I'd appreciate it.
 Otherwise, I'll try to get in and catch some specifics the next time it
happens.




 CPU usage stays relatively low, but it’s all system time reported, user
 time goes to zero. It doesn‘t seem to be disk related since we’re running
 with a shared_buffers setting of 24G, which will fit just about our entire
 database into memory, and the IO transactions reported by the server, as
 well as the disk reads reported by Postgres stay consistently low.

 There have been reports that using very large shared_buffers can cause a
 lot of contention issues in the kernel, for some kernels. The usual advice
 is not to set shared_buffers above 8GB.  The operating system can use the
 rest of the memory to cache for you.

 Also, using a connection pooler and lowering the number of connections to
 the real database has solved problems like this before.


We're going to implement both of these changes tonight.  I was going to go
with 12G for shared_buffers based on Andrew's suggestion, but maybe I'll go
down to 8 if that seems to be the magic number.  We're also going to
decrease the max connections from 500 to 100 and decrease the pooled
connections per server.




  We‘ve recently started tracking how long statements take to execute,
 and we’re seeing some really odd numbers. A simple delete by primary key,
 for example, from a table that contains about 280,000 rows, reportedly took
 18h59m46.900s. An update by primary key in that same table was reported as
 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
 numbers don't seem reasonable at all.

 How are your tracking those?  Is it log_min_duration_statement or
 something else?


We're using log_min_duration_statement = 1000, sending the log messages to
syslog, then analyzing with pg_badger.



 Cheers,

 Jeff