PostgreSQL 6.4.2, compiled and installed on AIX 4.3.2 (a PowerPC
system).

Everything was going fairly well until this morning.  I noticed a
problem when I tried to do a routine query on my database this morning --
it's a front-page CGI that I've run dozens of times over the last few
days.  There was no response for several seconds, so I looked at the
database server.  Top said:

   PID USER     PRI NICE SIZE   RES STAT     TIME  CPU%     COMMAND
  7306 postgres  84   0 2864k 2428k  run  2:35:15 32.9/37.1 postgres
 43576 postgres  83   0 2412k 1536k  run  1:36:24 32.8/32.4 postgres
  7648 postgres  83   0 2252k 1796k  run  2:35:06 32.6/37.0 postgres

and "ps -fu postgres" showed 22 processes, most of them of the form

postgres 37180 27538   0 04:14:47      -  0:04 /usr/local/pgsql/bin/postgres daemon 
WWW.XXX.YYY.ZZZ sysinfo2 DELETE waiting 

(WWW.XXX.YYY.ZZZ is the primary IP address of the database server.)

I have a cron job that runs vacuum:

15 6 * * * . $HOME/.profile; echo 'vacuum;' | psql -d sysinfo2 >>vacuum.log 2>&1

At the tail of the vacuum.log file I had dozens of lines like this:

NOTICE:  Rel pg_type: TID 27/6: InsertTransactionInProgress 41947 - can't shrink 
relation
NOTICE:  Rel pg_type: TID 27/7: InsertTransactionInProgress 41947 - can't shrink 
relation
NOTICE:  Rel pg_type: TID 27/8: InsertTransactionInProgress 41947 - can't shrink 
relation

(Some were for pg_attribute, some for pg_class.)

There were many lines in the errlog file as well, but that got
clobbered when I stopped and restarted Postgres using my handy
startup/shutdown wrapper script:

-------------------------------------------------------------------------------
#!/bin/ksh

case "$1" in
        start)
                echo "Starting PostgreSQL: \c"
                su - postgres -c 'postmaster -i >errlog 2>&1' &
                echo "done."
                ;;
        stop)
                echo "Stopping PostgreSQL: \c"
                su postgres -c \
                        'kill `ps -fu postgres | grep -v grep |
                                grep postmaster | awk '\''{print $2}'\''`'
                echo "done."
                ;;
        *)
                echo "usage: $0 {start|stop}" 1>&2
                exit 1
                ;;
esac
-------------------------------------------------------------------------------

At that point the query I was trying worked again, but it looks like
some data that should have been updated this morning before the vacuum
are not there.  (One of the fields in one of the tables that are
updated is a timestamp.  Of the 27 updates, 10 succeeded -- the last of
them at 04:14:10.)

Are there any known reliability issues with Postgres 6.4.2?  Should I
schedule a daily or weekly restart of the postmaster through cron in
addition to the daily vacuum?

Perhaps some background detail would be useful.  This database should
see light activity most of the time (mostly queries, with perhaps a few
tiny updates throughout the day), but between 0400 and 0430 it gets hit
with a whole bunch (27, as noted above) of massive updates.  The
database holds configuration details for all the Unix systems we
support (copies of /etc/filesystems, network interfaces, SNA
configuration file, etc.), most of it in large objects (ugly).
Starting at around 0400, all the systems that we support send their
details to the database server, which updates the database with the new
reports (these are driven by crontab entries on the remote systems; but
the remote system clocks are not in sync, so incoming reports are
staggered a little bit).  It takes perl/Postgres about 10-15 seconds to
process an incoming report when I run one manually -- this includes
some non-database manipulations, of course.

Should I spread out the reports so that they don't hit the database
quite as hard?  I could stagger the report timing a bit more, so that
the load is spread more evenly, but it seems that Postgres must have a
bug -- it shouldn't have been so tangled up in the first place.

I can provide SQL and table schemas if anyone wants to see them....

(P.S. -- the "Report a Bug" link on the web page does not work.  It
causes my browser to display HTML in the frame, rather than a rendered
page.  This is with Netscape 4.5 under AIX, through two proxy servers
(the first is Apache, the second seems to be CERN httpd), but other pages
on postgresql.nextpath.com look fine.)

Reply via email to