By the way, in spite of my questions and concerns, I was *very* impressed by 
the recovery process.  I know it might seem like old hat to you guys to watch 
the WAL in action, and I know on a theoretical level it's supposed to work, but 
watching it recover 150 separate databases, and find and fix a couple of 
problems was very impressive.  It gives me great confidence that I made the 
right choice to use Postgres.

Richard Huxton wrote:
 2. Why didn't the database recover?  Why are there two processes
    that couldn't be killed?

I'm guessing it didn't recover *because* there were two processes that couldn't be killed. Responsibility for that falls to the operating-system. I've seen it most often with faulty drivers or hardware that's being communicated with/written to. However, see below.

It can't be a coincidence that these were the only two processes in a SELECT 
operation.  Does the server disable signals at critical points?

I'd make a wild guess that this is some sort of deadlock problem -- these two 
servers have disabled signals for a critical section of SELECT, and are waiting 
for something from the postmaster, but postmaster is dead.

This is an ordinary system, no hardware problems, stock RH FC3 kernel, stock PG 
8.1.4, with 4 GB memory, and at the moment the database is running on a single 
SATA disk.  I'm worried that a production server can get into a state that 
requires manual intervention to recover.

Craig

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to