Re: [HACKERS] Stats Collector Oddity

2006-12-21 Thread Chris Browne
[EMAIL PROTECTED] (Tom Lane) writes:
 Chris Browne [EMAIL PROTECTED] writes:
 We're getting a bit of an anomaly relating to pg_stat_activity...
 ...
 That PID has been dead for several days, but this connection is marked
 as being open, still, after lo many days.

 This probably just means that the backend termination stats message
 got dropped due to heavy load.  That's expected behavior in all pre-8.2
 releases: the stats system was never intended to provide
 guaranteed-exactly-correct status.  PG 8.2 has reimplemented the
 pg_stat_activity view to make it more trustworthy.  (The other stuff is
 still probabilistic, but being just event counters, message loss isn't
 so obvious.)

That seems a *bit* surprising; the system wasn't expected to be under
particularly heavy load during the period in question; I would have
expected particularly light load.  No matter; there may have been
some brief heavy load to cause this.

There isn't any way, short of restarting the postmaster, to get rid of
that PID, is there?
-- 
cbbrowne,@,linuxdatabases.info
http://cbbrowne.com/info/linuxdistributions.html
High-level languages are a pretty good indicator that all else is
seldom equal. - Tim Bradshaw, comp.lang.lisp

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Stats Collector Oddity

2006-12-21 Thread Tom Lane
Chris Browne [EMAIL PROTECTED] writes:
 There isn't any way, short of restarting the postmaster, to get rid of
 that PID, is there?

The entry will get overwritten when that BackendId slot gets re-used,
so just starting enough concurrent backends should do it.  (Since
incoming backends always take the lowest free slot, the fact that the
dead entry has persisted awhile means that it must have a number higher
than your normal number of concurrent sessions ... which is evidence
in favor of the idea that it happened during a load spike ...)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Stats Collector Oddity

2006-12-21 Thread Christopher Browne
A long time ago, in a galaxy far, far away, [EMAIL PROTECTED] (Tom Lane) wrote:
 Chris Browne [EMAIL PROTECTED] writes:
 There isn't any way, short of restarting the postmaster, to get rid of
 that PID, is there?

 The entry will get overwritten when that BackendId slot gets re-used,
 so just starting enough concurrent backends should do it.  (Since
 incoming backends always take the lowest free slot, the fact that the
 dead entry has persisted awhile means that it must have a number higher
 than your normal number of concurrent sessions ... which is evidence
 in favor of the idea that it happened during a load spike ...)

Cool.  I started up a nice little bunch of psql sessions in the
background, and then once they were all up, shut down my shell
session, thereby eliminating them.  And that did, indeed, clear out
that pg_stat_activity entry.

... And five minutes later, Nagios sent me message indicating that
node had recovered from having an ancient open connection.

I'll re-add a few gratuitous details here in the hopes that that makes
this easily findable if anyone else should search for the issue...

The Problem:
 - pg_stat_activity was reporting an elderly transaction in progress

 - that backend process wasn't running anymore

 - pg_stat_activity *was* reporting other legitimate activity; this
   was not the scenario where it had gotten deranged (normally due to
   excessive load)

 - Per Tom's comments, there evidently *was* some load spike where
   the closing of this particular connection did not get logged by
   the stats collector

The Solution: 

 - We needed to roll the stats collector through a bunch of its slots
   in order to clean the apparently-still-populated entry out.

 - Ran, in a shell:
 for i in `seq 100`; do
 psql 
 done

   That left 100 psql sessions in the background, all connected to the
   database backend.

 - Closed the shell.  That then HUPped the 100 psql sessions.

That got the offending pg_stat_activity entry cleared out.
-- 
let name=cbbrowne and tld=gmail.com in name ^ @ ^ tld;;
http://linuxdatabases.info/info/finances.html
Temporary tattoos  are a CRITICAL  ELEMENT  of our security strategy.
To suggest otherwise is sheer lunacy.  -- Reid Fleming, cDc

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Stats Collector Oddity

2006-12-20 Thread Chris Browne
We're getting a bit of an anomaly relating to pg_stat_activity...

oxrstld=# SELECT * from pg_stat_activity where current_query  'IDLE';
   datid| datname | procpid | usesysid | usename | current_query |  
query_start  
+-+-+--+-+---+---
 1347729970 | oxrstld |  893094 |  122 | tldepp  | commit| 
2006-12-16 19:34:08.583978+00
(1 row)
oxrstld=# select version();
   version
--
 PostgreSQL 7.4.12 on powerpc-ibm-aix5.3.0.0, compiled by GCC gcc (GCC) 4.0.1
(1 row)

That PID has been dead for several days, but this connection is marked
as being open, still, after lo many days.

This has *traditionally* been a sign that the stats collector has been
clobbered, at which point pg_stat_activity becomes useless.  That's
not the case here; a less restricted query on pg_stat_activity shows
other more recent data that keeps changing, seemingly consistent with
system activity.

I can track most of the lifecycle of that PID; the connection was
established at 2006-12-16 18:46:38, and I see a few errors associated
with the PID (violations of unique constraints; usual business stuff).
I see nothing in the logs at around 19:34 which would suggest a reason
for any strange behaviour.  There are no records about that PID after
19:34...

I'd like to get rid of this entry, if I can; it's blowing up tests
that warn us about elderly transactions, causing a false positive...
-- 
output = reverse(moc.enworbbc @ enworbbc)
http://www3.sympatico.ca/cbbrowne/
Rules  of the  Evil Overlord  #212. I  will not  send  out battalions
composed wholly of robots or  skeletons against heroes who have qualms
about killing living beings.  http://www.eviloverlord.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Stats Collector Oddity

2006-12-20 Thread Tom Lane
Chris Browne [EMAIL PROTECTED] writes:
 We're getting a bit of an anomaly relating to pg_stat_activity...
 ...
 That PID has been dead for several days, but this connection is marked
 as being open, still, after lo many days.

This probably just means that the backend termination stats message
got dropped due to heavy load.  That's expected behavior in all pre-8.2
releases: the stats system was never intended to provide
guaranteed-exactly-correct status.  PG 8.2 has reimplemented the
pg_stat_activity view to make it more trustworthy.  (The other stuff is
still probabilistic, but being just event counters, message loss isn't
so obvious.)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match