On Sat, 8 Jan 2000, Brynn Rogers wrote:
> I have a Quad Pentium Pro running a buslogic BT-956C and a Adaptec 1542B
> (as a module). The aha1542 has only an external CD jukebox with a
> Toshiba XM-3401TA in it. After I mount the bad disk (which works
> fine), when I try to do an ls -R on that disk
> my system will lock up hard, and the cpu activity lights will show
> only one CPU active. I am running the redhat 6.0 kernel (2.2.5-15smp).
> Is there some bug in the driver that lets this scratched up CDROM lock
> the system?
Obviously the answer is yes, but offhand I don't know what it
might be. I went through and found several bugs in the 2.3 series when I
was running tests of my own using a scuffed up CDROM. Ultimately I got
it to the point where there were no bad side-effects at all, once I
cleaned up some of the problems I identified.
Suffice it to say that the code in 2.2 and in 2.3 is radically
different, and bugs that I found in 2.3 might not apply. Many of the bugs
that I found were new to the new queueing code in 2.3, for example. That
being said, it shouldn't be *that* hard to bring the 2.2 kernel up to the
same level of reliability. FWIW I was also using a 1542.
I would like it if you could take the lead in trying to figure out
what is wrong, as I am kind of tied up with 2.3 related things. If need
be, I could build and test such a kernel, but I would rather not at the
moment. For one thing, my test machine is uniprocessor (although I tend
to build with SMP turned on, as this is better at catching bugs). My
dual-processor machine doesn't have the 1542. I will give you a couple of
pointers that I used when doing this stuff:
1) On my system, e2fsck on my partitions was painfully slow. In
cases where I suspected a possible system crash, I basically first brought
the system down to single-user mode. This is done with:
/sbin/telinit 1
One of the bugs I found in the 2.3 series kernel was that the error
handler thread shut down in such cases. I believe that this bug still
exists in the 2.2 series kernels. Without an error handler thread,
anyone attempting to access the cdrom will be blocked forever. The
patches that I am enclosing correct this situation.
2) Unmount all but /, and remount / as readonly. This protects
your / partition from damage, and speeds up the reboot.
3) Turn on logging. Usually something like:
echo "scsi log all" > /proc/scsi/scsi
will work, but this only works if the kernel was built with logging
enabled. If logging isn't enabled, then you would need to rebuild a
kernel with logging enabled.
4) Torture the cdrom. You will get tons of messages on the
console indicating what is going on, and what the thing is trying to do.
At this point, note down everything left on the screen. Next, try
switching VC to see if this works. Also, try the Shift-scroll lock to see
if the system prints anything out. THese two things are important as
they would give us information about what might be going on. In
particular, if somebody forgot to release a lock, it is easy to imagine
that the whole system will get badly wedged. If we ended up in an
interrupt handler trying to grab a held lock, then the Shift-scroll lock
won't work.
> I have tried to upgrade my kernel to 2.2.14 which works fine, BUT the
> new kernel does not bring up my RAID-5 array of disks, I always have to
> go back to the stock redhat kernel. Anyone have any Idea what is up
> there?
No idea here.
-Eric
--- ./drivers/scsi/scsi.c.~1~ Tue Jan 4 13:12:21 2000
+++ ./drivers/scsi/scsi.c Sun Jan 9 13:47:25 2000
@@ -3040,7 +3040,7 @@
struct semaphore sem = MUTEX_LOCKED;
shpnt->eh_notify = &sem;
- send_sig(SIGKILL, shpnt->ehandler, 1);
+ send_sig(SIGHUP, shpnt->ehandler, 1);
down(&sem);
shpnt->eh_notify = NULL;
}
--- ./drivers/scsi/scsi_error.c.~1~ Mon Aug 9 15:04:40 1999
+++ ./drivers/scsi/scsi_error.c Sun Jan 9 13:49:25 2000
@@ -35,7 +35,19 @@
#include "hosts.h"
#include "constants.h"
-#define SHUTDOWN_SIGS (sigmask(SIGKILL)|sigmask(SIGINT)|sigmask(SIGTERM))
+/*
+ * We must always allow SHUTDOWN_SIGS. Even if we are not a module,
+ * the host drivers that we are using may be loaded as modules, and
+ * when we unload these, we need to ensure that the error handler thread
+ * can be shut down.
+ *
+ * Note - when we unload a module, we send a SIGHUP. We mustn't
+ * enable SIGTERM, as this is how the init shuts things down when you
+ * go to single-user mode. For that matter, init also sends SIGKILL,
+ * so we mustn't enable that one either. We use SIGHUP instead. Other
+ * options would be SIGPWR, I suppose.
+ */
+#define SHUTDOWN_SIGS (sigmask(SIGHUP))
#ifdef DEBUG
#define SENSE_TIMEOUT SCSI_TIMEOUT
@@ -1074,7 +1086,10 @@
}
else
{
- return FAILED;
+ /*
+ * No more retries - report this one back to upper level.
+ */
+ return SUCCESS;
}
}
@@ -1947,7 +1962,9 @@
current->fs = fs;
atomic_inc(&fs->count);
- siginitsetinv(¤t->blocked, SHUTDOWN_SIGS);
+ if( host->loaded_as_module ) {
+ siginitsetinv(¤t->blocked, SHUTDOWN_SIGS);
+ }
/*
@@ -1975,10 +1992,14 @@
* trying to unload a module.
*/
SCSI_LOG_ERROR_RECOVERY(1,printk("Error handler sleeping\n"));
- down_interruptible (&sem);
-
- if (signal_pending(current) )
- break;
+ if( host->loaded_as_module ) {
+ down_interruptible(&sem);
+
+ if (signal_pending(current))
+ break;
+ } else {
+ down(&sem);
+ }
SCSI_LOG_ERROR_RECOVERY(1,printk("Error handler waking up\n"));