Kurt Garloff, the current maintainer of the tmscsim driver and
I have discussed the multi-LUN changer problem by e-mail.
I would like to report the current status of debugging.
(seltimeout in tmscsim is related to per target and not LUN.
So it seems not to relate to the problem directly.
Also, the reset loop holding the lock and blocking IRQ was discussed
by Kurt and Eric elsewhere. This latter problem seems to disappear
at least when we move to new EH code. See below.)
We have theorised a few things, but the apparent difference between
tmscsim and BusLogic (which succeessfully handled the same MBR-7)
is the OLD and NEW Error-Handling code.
tmscsim used the old eh while BusLogic used new eh.
(Our theory is that the path taken by the old eh code
is not closely looked at any more during latest kernel upgrade
and may have hidden problems.)
So Kurt produced a very preliminary patch against tmscsim driver to
use the new eh finally.
I patched my driver and tested the operation.
Our initial guess that by going with new eh code
lessens the problem I saw seems to be justified.
My test.
step 1. Mount lun0 CD to /mnt1 (Free Solaris 7 Japanese edition CD.)
Mount lun1 CD to /mnt2 (A free CD that came with a magazine.)
step 2. Run ls -lR against /mnt1 in one window, and
trying to run ls -lR /mnt2 in another window.
Previsouly, in step 2, as soon as the second ls -lR begins to run
the MBR-7 changer after a few seconds went into RESET-loop and
I had to kill the ls processes by control-C.
(Further previously, in 2.1.xy and 2.0.3z, the system locked solid!,
but I digress.)
This time with the very first attempt to use new eh code in tmscsim driver,
in step 2, the long running "ls -lR" agains /mn1 ran to completion!
Well, the second "ls -lR" agains /mnt2
or even "find /mnt2 -type f print | xargs egrep test"
produced I/O errors and such, but only the commands in the
second window were affected and the first "ls -lR /mnt1" kept
on running. Previously the continuous repetition of RESET
rendered the system hard to use until I killed the ls processes.
This is a major improvement.
So, I think we can safely say that the new EH code
seems to be way to go. (which I believe is Eric has been
telling the driver authors for quite a long while, I think.)
Although the problem still remains with the handling of luns (after all, we
still see I/O errors and such on the second CD), I think
we have made a great progress in a couple of days.
My guess (and I think this is also Kurt's guess) is that
the path taken by old eh seems to trigger bus resetting too easily and
not well structured to avoid the excessive repetition. new eh code
seems to be much robust in this regard.
Now we have to find out why and where the further difference lies
between tmscsim and BusLogic.
MBR-7 is asynchronous whereas Brendan's CD unit turned out to be
synchronous. However, since the bus was essentially clear except
for MBR-7, this should not make much of a problem. Or should it?
MBR-7 seems to trigger hidden dormant bugs, and I have
a faulty hard disk to test tmscsim with new error handling code
once we solve this MBR-7 lun problem ! :-)
Happy Hacking
Chiaki Ishikawa
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]