Hello.

Sometimes I see a strange slowdown on my PG 9.1 server: it looks like the
simplest queries which typically take 1ms or less (e.g. selection of a row
by its primary key) take 300ms or even more. It is related to all queries
within the connection, not the single one: once upon a time all fast
queries start to work 300 times slower. In a couple of minutes the effect
disappears. The machine is not overloaded, number of connections is not too
high, mtr shows no packet loss.

E.g. see the following log from a particular website page I've been
refreshing (times are in milliseconds, you may see that typically all 22
queries take near 55ms, but within 3 minutes - from 20:21 till 20:23 approx
- even simplest queries took more than 300ms instead of 1ms, so the total
time is many seconds).

web-01: [2012-11-20 20:20:57] web07725] total=250        cpu=210
  php=203        dbq=22      db=55ms
web-01: [2012-11-20 20:21:06] web07752] total=6254       cpu=220
  php=219        dbq=22      db=6035ms    <--
web-01: [2012-11-20 20:21:48] web07718] total=7506       cpu=220
  php=222        dbq=22      db=7284ms    <--
web-01: [2012-11-20 20:22:15] web07717] total=11143      cpu=224
  php=225        dbq=22      db=10918ms   <--
web-01: [2012-11-20 20:23:44] web07725] total=4402       cpu=224
  php=220        dbq=22      db=4182ms    <--
web-01: [2012-11-20 20:24:03] web07725] total=268        cpu=212
  php=211        dbq=22      db=57ms

The machine is not overloaded during all this period (I've been logging
"top" and "pstree" each 10 seconds, results are quite typical):

Nov 20 20:23:07 pg-m03 mon: [Tue Nov 20 20:23:07 MSK 2012]
Nov 20 20:23:07 pg-m03 mon: init-+-crond
Nov 20 20:23:07 pg-m03 mon:      |-master-+-pickup
Nov 20 20:23:07 pg-m03 mon:      |        `-qmgr
Nov 20 20:23:07 pg-m03 mon:      |-postmaster---32*[postmaster]
Nov 20 20:23:07 pg-m03 mon:
 |-screen---sh---su---bash---monitoring.sh-+-logger
Nov 20 20:23:07 pg-m03 mon:      |
`-monitoring.sh---pstree
Nov 20 20:23:07 pg-m03 mon:      |-sshd---sshd---perl
Nov 20 20:23:07 pg-m03 mon:      |-syslog-ng
Nov 20 20:23:07 pg-m03 mon:      |-udevd
Nov 20 20:23:07 pg-m03 mon:      |-vzctl---bash---mc---bash
Nov 20 20:23:07 pg-m03 mon:      `-xinetd
Nov 20 20:23:07 pg-m03 mon:
Nov 20 20:23:08 pg-m03 mon: top - 20:23:08 up 113 days,  3:29,  0 users,
 load average: 2.16, 2.10, 2.00
Nov 20 20:23:08 pg-m03 mon: Tasks:   1 total,   0 running,   1 sleeping,
0 stopped,   0 zombie
Nov 20 20:23:08 pg-m03 mon: Cpu(s): 23.8%us,  2.8%sy,  0.0%ni, 72.7%id,
 0.8%wa,  0.0%hi,  0.0%si,  0.0%st
Nov 20 20:23:08 pg-m03 mon: Mem:   1463616k total,  1097776k used,
365840k free,        0k buffers

Also I've been logging all queries which are active in the database each 10
seconds. During the slowdown the database was not overloaded, no strange
queries, except one thing: during the whole slowdown period two pg
processes was in "<IDLE> in transaction" state, and in 3 minutes (see 3rd
column - it's a "time from start of the query") these processes
disappeared, and the slowdown disappeared too:

                                     | duration        | query
Nov 20 20:22:05 pg-m03 mon:    21586 | 00:00:37.571057 | <IDLE> in
transaction
Nov 20 20:22:05 pg-m03 mon:    21567 | 00:00:37.557606 | <IDLE> in
transaction
Nov 20 20:22:36 pg-m03 mon:    21586 | 00:01:09.119639 | <IDLE> in
transaction
Nov 20 20:22:36 pg-m03 mon:    21567 | 00:01:09.106188 | <IDLE> in
transaction
Nov 20 20:22:47 pg-m03 mon:    21586 | 00:01:19.637888 | <IDLE> in
transaction
Nov 20 20:22:47 pg-m03 mon:    21567 | 00:01:19.624437 | <IDLE> in
transaction
Nov 20 20:22:57 pg-m03 mon:    21586 | 00:01:30.155492 | <IDLE> in
transaction
Nov 20 20:22:57 pg-m03 mon:    21567 | 00:01:30.142041 | <IDLE> in
transaction
Nov 20 20:24:21 pg-m03 mon:    21586 | 00:02:54.294799 | <IDLE> in
transaction
Nov 20 20:24:21 pg-m03 mon:    21567 | 00:02:54.281348 | <IDLE> in
transaction
Nov 20 20:24:31 - no more "<IDLE> in transaction" (seems they disappeared
after 3 minutes timeout?)

Maybe these are correlated? But how could it be that two "idle" processes
cause slowdown of other connections (note: not "timeout", but exactly -
slowdown from 1ms to 300ms)?

What could be the problem of all these things? Such slowdown typically
happens 2-3 times per day, no direct correlation with checkpoints, disks
loads or something else. It just starts to work slow even on simplest
queries, and then it restores normal operations.

Reply via email to