Hello. Today I've got the following situation.
While a print filter script was writing data to usb printer (actualy
'cat data > /dev/usb/lp0' worked), unexpected usb disconnect happened (our
printer is somewhat buggy and sometimes behaves like a disconnect happens;
in such cases we usually restart it and it works again).
The result was that both 'cat' and 'khubd' processes hanged in a busy loop
(top showed each was R using 100% CPU, and also keventd/2 had 20%; it was
on a dual-xeon system). It was not possible to kill hanged 'cat' process.
System log got several hundreds of messages 'usb0: error -19 reading
printer status'; looks like then printk buffer get overflown and then
messages stopped; restarting klogd resulted in some more copies of the
messages.
The error message helped me to identify the loop in the kernel where it
hanged. It was in usblp_write(). It was in the following code:
while (writecount < count) {
if (!usblp->wcomplete) {
...
}
down (&usblp->sem);
if (!usblp->present) {
up (&usblp->sem);
return -ENODEV;
}
if (usblp->writeurb->status != 0) {
if (usblp->quirks & USBLP_QUIRK_BIDIR) {
if (!usblp->wcomplete)
err("usblp%d: error %d writing to printer",
usblp->minor, usblp->writeurb->status);
err = usblp->writeurb->status;
} else
err = usblp_check_status(usblp, err);
up (&usblp->sem);
/* if the fault was due to disconnect, let khubd's
* call to usblp_disconnect() grab usblp->sem ...
*/
schedule ();
continue;
}
...
}
Looks like (!usblp->wcomplete) was false, and (!usblp->present) was false,
and (usblp->writeurb->status != 0), so it just looped in this loop
forever, ignoring any signals.
Since it was on a production server running several user X sessions, I
tried to 'fix' the situation without reboot, by writing a tiny kernel
module that locates the 'usblp' object from that code and sets
'usblp->present' to false. When I insmoded such thing, the busy loop was
really broken and 'cat' process at last got it's SIGKILL (thus somewhat
proving the guess of the hanged code), but khubd got an oops. Later
attempts to recover from the situation failed (rmmoding usb modules hanged
at semaphores, I started to force semaphores up by insmoding code, but at
some moment I probably mistyped a binary address and whole system
crashed).
Anyway. looks like some bug in the mentioned code? It's clear that
busy-loop is possible there. Maybe at least it should check for signals
after return from schedule()?
Kernel 2.6.10 from debian package kernel-image-2.6.10-1-686-smp, version
2.6.10-6.
pgpgmh707zchH.pgp
Description: PGP signature
