On Nov 28, 2005, at 1:46 PM, Tom Lane wrote:

James Robinson <[EMAIL PROTECTED]> writes:
backtrace of the sshd doesn't look good:

Stripped executable :-( ... you won't get much info there.  What of
the client at the far end of the ssh connection?  You should probably
assume that the blockage is there, rather than in a commonly used bit
of software like ssh.

Ok: cron fired off a bash running our script which performs the backup + vacuuming on the backup box side, and that script was at the point of driving vacuumdb

sscadmin 20612 20610 0 Nov25 ? 00:00:00 bash /usr/local/bin/ db_backup.sh sscadmin 20622 20612 0 Nov25 ? 00:00:00 ssh -c blowfish [ ssh identity file + host edited out ] /usr/local/pgsql/bin/vacuumdb -U postgres --all --analyze --verbose

[ yes, verbose vacuum. Who knew that'd be the camel-breaking straw ??! ]



The lines in the script invoking the ssh'd vacuumdb is:
--
# Vacuum all databases, storing log results.
$SSHCMD $DBHOST /usr/local/pgsql/bin/vacuumdb -U postgres --all -- analyze --verb
ose >& $DATE/vacuum.log
--

Unfortunately the dir holding that date + hour's vacuum.log was swept away by the next day's activities.

The stuck bash is backtraceable:

(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbffff928 in ?? ()
#2  0x00000000 in ?? ()
#3  0xbffff918 in ?? ()
#4  0xb7ed1513 in __waitpid_nocancel () from /lib/tls/libc.so.6
#5  0x080935bf in default_tty_job_signals ()
#6  0x080949ca in wait_for ()
#7  0x0808acd7 in execute_command_internal ()
#8  0x0808a4f0 in execute_command ()
#9  0x0808241d in reader_loop ()
#10 0x08081364 in main ()
(gdb)

Nothing unexpected there. Funny though, file reports /bin/bash as being stripped ( as it does for /usr/bin/ssh and /usr/sbin/sshd ), but I could get far better debugging support from it. Could the debugging issue be ssh / sshd's apparent multithreadedness:

gdb /usr/bin/ssh
GNU gdb 6.2.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i586-suse-linux"...(no debugging symbols found)...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) run localhost
Starting program: /usr/bin/ssh localhost
(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled]
[New Thread 1078408704 (LWP 29932)]
(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is f4:cd:bc:37:d7:08:bc:4f:04:91:45:9b:44:cf:d5:b9.
Are you sure you want to continue connecting (yes/no)?
Program received signal SIGINT, Interrupt.
[Switching to Thread 1078408704 (LWP 29932)]
0xffffe410 in ?? ()
(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbfffb9e8 in ?? ()
#2  0x00000001 in ?? ()
#3  0xbfffb3af in ?? ()
#4  0x402f94b3 in __read_nocancel () from /lib/tls/libc.so.6
#5  0x080742e1 in mkstemp64 ()
#6  0x080684c8 in error ()
#7  0x08054e91 in ?? ()
#8  0xbfffcac0 in ?? ()
#9  0x00000001 in ?? ()
#10 0x08096230 in ?? ()
#11 0xbfffcac0 in ?? ()
....

I know you have far better things to do than teach someone how to use gdb on multithreaded programs, but could a proper backtrace be salvageable on the ssh client? If you really care, that is, otherwise I'm off to kill that vacuumdb client.




At this moment in time, should we kill off the offending processes
from Nov 25 -- starting from client-most side all the way to the
vacuumdb process on the production server. The other vacuums would
probably then complete happily, and we'd be cool again, eh?

If you just want to get out of it, killing the vacuumdb should be the
least dangerous way to get out of the problem.  I'd suggest taking a
little more time to try to find out what's stuck though.

Given the other culprits in play are bash running a straightforward shellscript line with redirected output to a simple file on a non- full filesystem, I'm leaning more towards the odds that something related to the sshd + tcp/ip + ssh client portion of things went crazy. Just seems that's where more complexity is -- bash is doing nothing but waiting for the ssh client to finish, and its stdout / stderr were redired to the fileystem before the bash child exec'd ssh. Even if the filesystem became full somehow during that run, one would expect the ssh client's writes to the file being error'd out as opposed to blocking. Funky.



I suppose we're darn lucky the process got ultimately gummed up on a
table that sees no traffic at all to it, eh? The lock that vacuum has
taken out on it would prevent at least some things happening to the
table in question -- possibly even new inserts or updates?

No, since it's a lazy vacuum it's not going to block either reads or
writes.  Just other vacuums and DDL changes on the table.

Good to know.


Could this potentially be alleviated in the future by a little code
reordering in vacuumdb or postmaster by completing working on the
current table completely before emitting output,

Wouldn't help.  I would imagine that by the time we got to this state,
the backend side is quite a few tables past the point where the end
client stopped accepting output.  You'd normally expect there to be
several Kb worth of buffering in the network channel, and the backend
isn't going to be blocked till that's *all* used up.

But when it does get blocked up, albeit after-the-fact, it wouldn't be holding and postgres-level locks. But, since this was lazy vacuum, only other lazy vacuums or DDL would be blocked. But had some other poor chap been running full vacuum, then it could have been worse.


BTW, I suppose this was a VACUUM VERBOSE?  Without the verbose output,
you'd likely not have seen any problem ...

Yep --  see above.

I do believe we've agreed internally to separate out the vacuuming bit from the backup bit, and have that be driven by a separate cron'd script on the primary database box itself to simplify the system for this case. And with 8.1 we'll move to the autovacuum service, so as to obliterate the need altogether. So, our vacuuming would be happening in a globally far simpler system -- no remote host networking involved at all.

Again, thanks again for all of your time and superb support.

----
James Robinson
Socialserve.com


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Reply via email to