Re: Unkillable processes?
Jerry Feldman [EMAIL PROTECTED] writes: On Sat, 25 Feb 2006 02:36:01 -0500 [EMAIL PROTECTED] wrote: When I encounter processes which are unresponsive to kill -9, I find that this generally works: runlevel # say the current runlevel is 3 telinit 1 telinit 3 This will almost always work, especially with zombie processes. What you are doing is transitioning into single-user mode. Ahm yeah, that's not usually an option on a production server :) Then of course going to run level 6 tends to cure all ills :-) Yes, yes it does. I usually try to find what the problem is using ps and lsof and a variety of other tricks to fix things before trying runlevel 6. On a production system, it's sometimes better to just leave the system alone and wait if it's not causing any major problems, as a reboot is often more disruptive than anything else. -- Seeya, Paul ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On 2/25/06, Michael ODonnell [EMAIL PROTECTED] wrote: The kernel has no notions such as run level and single-user mode So, to the extent that frobbing your run level has any effect at all on any wedged processes, it's a safe bet that it's a distant side effect of all the other flailing that's going on. See also: Shotgun debugging. http://www.catb.org/jargon/html/S/shotgun-debugging.html ... the making of relatively undirected changes to software in the hope that a bug will be perturbed out of existence ... :-) -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Sat, 25 Feb 2006 02:36:01 -0500 [EMAIL PROTECTED] wrote: When I encounter processes which are unresponsive to kill -9, I find that this generally works: runlevel # say the current runlevel is 3 telinit 1 telinit 3 This will almost always work, especially with zombie processes. What you are doing is transitioning into single-user mode. Many times, if you are in run level 5 (GUI in most systems), transitioning to run level 3 (multi-user mode no GUI) will also do some very good cleanup. Then of course going to run level 6 tends to cure all ills :-) -- Jerry Feldman [EMAIL PROTECTED] Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9 signature.asc Description: PGP signature
Re: Unkillable processes?
On Feb 25, 2006, at 08:15, Jerry Feldman wrote: This will almost always work, especially with zombie processes. What you are doing is transitioning into single-user mode. Any idea what the underlying mechanism is that makes this work? -Bill - Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 [EMAIL PROTECTED] Cell: 603.252.2606 http://www.bfccomputing.com/Page: 603.442.1833 Blog: http://blog.bfccomputing.com/ VCard: http://bfccomputing.com/vcard/bill.vcf ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Sat, 25 Feb 2006 10:40:19 -0500 Bill McGonigle [EMAIL PROTECTED] wrote: On Feb 25, 2006, at 08:15, Jerry Feldman wrote: This will almost always work, especially with zombie processes. What you are doing is transitioning into single-user mode. Any idea what the underlying mechanism is that makes this work? First, the kill(1) command calls the kill(2) system call with very little else. In Unix and Linux there is effectively a process tree. The /sbin/init(8) command is always process number 1 and is The Mother of All Processes. When you boot into run level 1 (Single User Mode) there should be no daemons running, and no file systems mounted other than root, unless you manually mounted them. The only other user user processes running are generally those associated with the superuser. (There will be some kernel processes running, but those show in square brackets [kthread]. When your system transitions into multi-user mode (either run levels 3 or 5), a number of daemon processes are executed as well as some getty's (virtual terminals). When you start a process from your shell, that process inherits not only that shell's environment, but it's ownership as whoever is logged in. In the GUI mode, you have the X server running, then a display manager, then the window manager, and when you cause a process to start it has an ancestry. When you transition down to single user mode, the system kills everything. Most unkillable processes tend to be locked into an effective deadlock that is resolved when you get rid of some of the peers and ancestors. Transitioning to single user mode also shuts down the network. However, there can be cases where a process is locked in an indefinite I/O wait. It is very rare when these occur and even single user mode won't clear it. The most common indefinite I/O wait that I have seen is NFS where the network of NFS server gets hosed. In this case, going to single user will clear it. -- Jerry Feldman [EMAIL PROTECTED] Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9 signature.asc Description: PGP signature
Re: Unkillable processes?
This will almost always work, especially with zombie processes. What you are doing is transitioning into single-user mode. Any idea what the underlying mechanism is that makes this work? The kernel has no notions such as run level and single-user mode - that's all state that's managed strictly in user space - and there's nothing special or sacred about any of those modes. You can do everything in single-user mode that you can do in any other mode, provided you're prepared to do a bunch of stuff by hand that's normally automated. The different modes are simply different rules governing which things are automated. Meanwhile, a process that's wedged or unresponsive is one that's unable to make progress until certain conditions (managed by/in the kernel) are met, such changes most often being associated (as others have mentioned) with device state where, too often, we're at the mercy of badly coded drivers :-( So, to the extent that frobbing your run level has any effect at all on any wedged processes, it's a safe bet that it's a distant side effect of all the other flailing that's going on. I'm not saying that such frobbing never leads to that Fresh Feeling(tm), just that the cause-effect relationship is anything but straightforward or reliable. It always seems like a defeat when I'm reduced to such superstitious measures because it feels so similar to the Therapeutic Reboots certain other OS's require. BTW, The term zombie is usually reserved for the specific case of a process that has already exited but has not yet been reaped - its parent (or init, for orphans) has not yet done a wait() on it, which is necessary for its process slot to be freed. I don't think that's what's being discussed here. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
When I encounter processes which are unresponsive to kill -9, I find that this generally works: runlevel # say the current runlevel is 3 telinit 1 telinit 3 For some reason, init seems to clean them up when nothing else does. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On 2/18/06, Paul Lussier [EMAIL PROTECTED] wrote: That means they're in uninterruptable sleep ... Bad hardware or buggy device drivers are the most common cause of a process stuck in this state. Or, an NFS server from which this system mounted a file system has gone off the net. Oh, yah, that. I don't have to use NFS very often. Thankfully. :) ... any process which stats all mounted file systems (think df, ls, etc.) hangs and can't be killed. Won't mounting the NFS filesystems with the soft,intr options prevent that from happening in the first place? (The can't be killed part -- obviously, they'll still choke if they try to contact a dead server.) -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
Bill McGonigle [EMAIL PROTECTED] writes: On Feb 18, 2006, at 21:26, Paul Lussier wrote: One that's done, configure the NIC to use the exact same IP as the NFS server which suddenly disappeared. Did you try to bring up a virtual interface on the problem machine with the original server's IP? Just curious if this would work. If so, it might be near-trivial to script such a thing by divining the destination IP, export, etc from mount, /proc, netstat, lsof and friends and wrap it all up in an nfskill(8) program to deal with the interface, route, etc. I didn't, but that's because the wedged client in this case was my primary NFS server which was NFS mounting a file system from our 'dogfood' system. We (I should specify I actually had nothing to do with this...) took down our test system to move it without making sure all systems were properly unmounted, which resulted in wedging my primary NFS server. I don't see why this wouln't have worked though, and I intend to find out this week sometime using some of our test equipment. I'll let you know! -- Seeya, Paul ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On 2/19/06, Paul Lussier [EMAIL PROTECTED] wrote: Won't mounting the NFS filesystems with the soft,intr options prevent that from happening in the first place? If you have the O'Reilly book for NFS and NIS ... Let's assume I don't. Please explain. In the past, when working with NFS, soft,intr got me the behavior I wanted -- problems not on my machine didn't hang my machine, and failures were propagated up to higher levels as I/O errors (which they are). I do recall increasing some timeout or other based on somebody's recommendation at the time. Granted, I've never used NFS all that much. Most of my experience with it was at UNH, where hung NFS mounts were a source of frustration. Again granted, that environment (the Space Science Center) wasn't a pillar of best practices, so maybe the whole setup was just hosed. I do recall that on a diskless workstation, soft,intr doesn't get you anything, since if NFS is out you've lost all your filesystems and you're hosed anyway (so you might as well wait forever), but diskless workstations are *so* passe. ;-) (I do have a copy of the book *somewhere*, but I can't find it right now. (I suspect it's in my box o' books at the office, which I will likely need an excavator to exhume.)) -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
The dread Stale NFS file handle (was Unkillable processes? )
I have a devious, wave-the-dead-chicken voodoo hack that will sometimes save the day when you're suffering the dread Stale NFS file handle error. It only unwedges NFS clients, and only sometimes, but when you're foaming at the mouth with frustration it's a good trick to know. ( Hey, Paul - you may be amused to know that I developed this while working at MCLX around the time you were managing the NFS servers...;- ) The following explanation is presented from my perspective where I'm trying to regain access to my NFS mounted home directory but should be applicable in other contexts if you tweak the appropriate things: ASSUME: - Home directory resides on serverBox:/aaa/bbb/ccc/modonnell - Home directory normally mounted on client system(s) locally as: /h/modonnell - serverBox has suffered some fault leading to the dread Stale NFS file handle error, preventing access to my home directory but also preventing unmount/remount ops. HACK: - Create new directory on client machine /tmp/staleHack - mount serverBox:/aaa/bbb/ccc/modonnel /tmp/staleHack - ls -la /tmp/staleHack - ls -la /h/modonnel - umount /tmp/staleHack - rmdir /tmp/staleHack Home directory /h/modonnel should once again be accessible. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
Ben Scott [EMAIL PROTECTED] writes: On 2/17/06, Dan Coutu [EMAIL PROTECTED] wrote: On a Red Hat 9 system I've encountered a situation where there are two processes that I cannot kill when using kill -9 (or any other value, for that matter.) Do a ps aux and note their status. It's D, right? That means they're in uninterruptable sleep -- waiting for system call to finish something that cannot be interrupted. The D stood for driver or disk originally. Bad hardware or buggy device drivers are the most common cause of a process stuck in this state. Or, an NFS server from which this system mounted a file system has gone off the net. I've seen this quite often. You've got some system which exports a file system via NFS, it goes down, and on all clients which NFS mount from this server, any process which stats all mounted file systems (think df, ls, etc.) hangs and can't be killed. The only thing you can do is wait or reboot the system. This isn't entirely true, *especially* if the cause is as I just described. If it is in fact a down NFS server from which the client didn't properly unmount the file system before it went away, there's is a cure. That cure is to bring the NFS server back online. However, it may be that you can't bring *that* NFS server back on line for some reason, at least not any time soon. In that case, find some other system, configure it as an NFS server, and export *something* as the same name as the file system which is wedged. One that's done, configure the NIC to use the exact same IP as the NFS server which suddenly disappeared. This won't fool NFS entirely, but just enough such that the client with get a 'stale NFS file handle' error, and allow you to umount the file system. I recently used this trick in exactly this way. If this isn't enough detail, let me know, and I'll send out the postmortem I mailed to my dev group (which isn't all that much more detailed than this e-mail :) If the syscalls ever complete, the kernel will immediately process the kill signals you sent, so those processes are dead, they just don't know it yet. :) And that's what this NFS spoofing trick does. It essentially allows the processes enough room to complete and die :) -- Seeya, Paul ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Friday, Feb 17th 2006 at 13:58 -0500, quoth Dan Coutu: =Just to add more confusion to the mix, or maybe a useful clue, the system load =average is about 4 but top shows 97% system idle time. Strange. You've gotten good advice, I just wanted to add one thing. The load average is simply a count of the number of processes that are in the run queue over a specific period of time. It just means that you have four processes which sit in the run queue and are either doing nothing or doing io which does not consume much cpu resource. -- Time flies like the wind. Fruit flies like a banana. Stranger things have .0. happened but none stranger than this. Does your driver's license say Organ ..0 Donor?Black holes are where God divided by zero. Listen to me! We are all- 000 individuals! What if this weren't a hypothetical question? steveo at syslang.net ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Feb 18, 2006, at 21:26, Paul Lussier wrote: One that's done, configure the NIC to use the exact same IP as the NFS server which suddenly disappeared. Did you try to bring up a virtual interface on the problem machine with the original server's IP? Just curious if this would work. If so, it might be near-trivial to script such a thing by divining the destination IP, export, etc from mount, /proc, netstat, lsof and friends and wrap it all up in an nfskill(8) program to deal with the interface, route, etc. -Bill - Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 [EMAIL PROTECTED] Cell: 603.252.2606 http://www.bfccomputing.com/Page: 603.442.1833 Blog: http://blog.bfccomputing.com/ VCard: http://bfccomputing.com/vcard/bill.vcf ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Unkillable processes?
Okay, here's a strange one. On a Red Hat 9 system I've encountered a situation where there are two processes that I cannot kill when using kill -9 (or any other value, for that matter.) What's the deal with that? The only kind of process that I've ever run across that I could not kill was a zombie and neither one of these is a zombie. Just to add more confusion to the mix, or maybe a useful clue, the system load average is about 4 but top shows 97% system idle time. Strange. Any ideas? Dan ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Friday 17 February 2006 01:58 pm, Dan Coutu wrote: Okay, here's a strange one. On a Red Hat 9 system I've encountered a situation where there are two processes that I cannot kill when using kill -9 (or any other value, for that matter.) The processes could be in an IO Lock, maybe trying to access an NFS share with hard locking and not INTR option set? Just to add more confusion to the mix, or maybe a useful clue, the system load average is about 4 but top shows 97% system idle time. Strange. That's suspicious, but I suppose not entirely impossible. Have you done anything like chkrootkit on it, just for kicks? -Neil ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Fri, Feb 17, 2006 at 02:35:15PM -0500, Neil Schelly wrote: On Friday 17 February 2006 01:58 pm, Dan Coutu wrote: Okay, here's a strange one. On a Red Hat 9 system I've encountered a situation where there are two processes that I cannot kill when using kill -9 (or any other value, for that matter.) The processes could be in an IO Lock, maybe trying to access an NFS share with hard locking and not INTR option set? Just to add more confusion to the mix, or maybe a useful clue, the system load average is about 4 but top shows 97% system idle time. Strange. That's suspicious, but I suppose not entirely impossible. Have you done anything like chkrootkit on it, just for kicks? IO locks will cause the load to be high with a low idle time. -Mark signature.asc Description: Digital signature
Re: Unkillable processes?
On 2/17/06, Dan Coutu [EMAIL PROTECTED] wrote: On a Red Hat 9 system I've encountered a situation where there are two processes that I cannot kill when using kill -9 (or any other value, for that matter.) Do a ps aux and note their status. It's D, right? That means they're in uninterruptable sleep -- waiting for system call to finish something that cannot be interrupted. The D stood for driver or disk originally. Bad hardware or buggy device drivers are the most common cause of a process stuck in this state. The only thing you can do is wait or reboot the system. If the syscalls ever complete, the kernel will immediately process the kill signals you sent, so those processes are dead, they just don't know it yet. :) -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Friday 17 February 2006 1:58 pm, Dan Coutu wrote: Okay, here's a strange one. On a Red Hat 9 system I've encountered a situation where there are two processes that I cannot kill when using kill -9 (or any other value, for that matter.) What's the deal with that? The only kind of process that I've ever run across that I could not kill was a zombie and neither one of these is a zombie. Just to add more confusion to the mix, or maybe a useful clue, the system load average is about 4 but top shows 97% system idle time. Strange. We ran into the same type of thing on a RHEL 3.0 Update 4 system a few weeks ago. I think we decided that it was waiting on I/O that did not complete. -- Jerry Feldman [EMAIL PROTECTED] Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9 ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
Ben Scott wrote: On 2/17/06, Dan Coutu [EMAIL PROTECTED] wrote: On a Red Hat 9 system I've encountered a situation where there are two processes that I cannot kill when using kill -9 (or any other value, for that matter.) Do a ps aux and note their status. It's D, right? That means they're in uninterruptable sleep -- waiting for system call to finish something that cannot be interrupted. The D stood for driver or disk originally. Bad hardware or buggy device drivers are the most common cause of a process stuck in this state. The only thing you can do is wait or reboot the system. If the syscalls ever complete, the kernel will immediately process the kill signals you sent, so those processes are dead, they just don't know it yet. :) -- Ben Hmm, the I/O wait seems likely. We've been having trouble with an IOMega REV 10 disk autoloader ever since we bought the thing. Even swapped it out for a new one but still get flaky behavior. Maybe it's time to send the thing back... Dan ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On 2/17/06, Dan Coutu [EMAIL PROTECTED] wrote: Hmm, the I/O wait seems likely. We've been having trouble with an IOMega REV 10 disk autoloader ever since we bought the thing. Yikes! I've never, ever encountered an IOMega product that didn't suck in some major way. No wonder it doesn't work. Maybe the kernel is just refusing to have anything to do with such a crummy product on principle. -- Ben Click of death Scott ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Unkillable processes?
On Feb 17, 2006, at 15:55, Ben Scott wrote: 've never, ever encountered an IOMega product that didn't suck in some major way. No wonder it doesn't work. Hey, my first linux box ran off a 150MB Bernoulli drive hooked up to my soundblaster. That was before Iomega gave up on Bernoulli effect media for the mass market, of course. -Bill - Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 [EMAIL PROTECTED] Cell: 603.252.2606 http://www.bfccomputing.com/Page: 603.442.1833 Blog: http://blog.bfccomputing.com/ VCard: http://bfccomputing.com/vcard/bill.vcf ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss