Gerard,

Thanks for your reply.
It does look like the sym53c8xx driver does work with the newer
completion code...at least it works when no errors happen. The
corruption problem I saw happens after doing some error recovery. I need
to explore more on this, but some bad things happen when a reset happens
with queued requests in a device. They incorrectly get failed and the
upper layers don't deal with that too well.

more below...

<>< Lance.


Gerard Roudier wrote:
> On Wed, 18 Aug 1999, D. Lance Robinson wrote:
> >
> > I am seeing stack overflows when doing riggourous error testing while
> > using the sym53c8xx driver. This would not happen if the driver was
> > using the newer scsi error handling code. The newer code uses a queue
> > with a bottom half driver and this will prevent the error cases that I'm
> > seeing.
> 

No more overflows once item (1) below is fixed, and the newer
completion/error handling code is used. But as stated, other problems
lurk.


> I know what the assumed not obsolete scsi linux code is doing. I just
> haven't had time for now to try it.
> Note that the driver only calls scsi_done() from its entry points and so a
> stack overflow can only occurs if the caller does wrong things as calling
> recursively driver entry points from scsi_done() or friends.
> 
> > I have tried to switch the driver over to the newer scsi error handling
> > code but have data corruption problems.  I have modified the SYM53C8XX
> > #define (in sym53c8xx.h) to include use_new_eh_code:1, and modified
> > sym53c8xx_queue_command() so it always returns 0.  Things appear to work
> > fine for a while, but in 1-2 minutes of testing, my test code detects
> > data corruption in one of its files. The cache gets corrupted. Every
> > corruption I have looked at seems to be file based... that is, the
> > information that should go at the beginning of a file is seen at the
> > beginning of a different file.

As stated above. There is no corruption of data from the scsi code.
However, it is incorrectly failing a request because it got tangled up
with a scsi bus reset. The problem is most likely in the newer error
code. Also, the upper sd or buffer layers are dealing with the failure
incorrectly...which does look like data corruption.

> 
> I ask me questions about the status of the use_new_eh_code. It seems that
> only eata, u14-34f, aha1542, gdth and qlogicfc are using it. Btw, eata and
> u14-34f seem to allow to disable this option from the boot command line.
> 
> > ------------------------
> > Notes on what I am doing to generate the stack overflow errors...
> >
> > I have modified the driver slightly to handle missing devices as will
> > happen if someone yanks a drive from the bus (while in an appropriate
> > scsi backplane). The driver was changed to set a flag (bad_select) when
> > a select timeout happens. All new commands to that device, other than
> > TEST UNIT READY, will be rejected for that drive. When a TEST UNIT READY
> > command is seen, the bad_select bit is cleared.
> 
> > This all works for our situation, except when there is a backlog of
> > commands that are queued in the sd layer. When the command is rejected
> > in the sym53c8xx driver because of a previous bad_select, that command
> > is sent to the scsi done code which gets its way up to rw_intr, then
> > requeu_sd_request, then do_sd_request, back to requeue_sd_request, to
> > scsi_do_cmd, back to sym53c8xx_queue_command. If this command is also
> > rejected, the cycle continues for about 12-16 times in which the stack
> > overflows and the system freezes.
> 
> I see the problem here. The command is not retried, but the SCSI code
> just queues recursiverly numerous commands that fail.
> 
> Damned uncontrolled recursions!!!!!!!!!
> 
> Note that the recursion between do_sd_request() and requeue_sd_request()
> seems way stupid to me regardless new_eh_code or not. If this highly
> stupid recursion was fixed, then it should be possible to be aware of
> the offending recursion you described and make things right.
> 
> So, the right fix is:
> 
> 1) Remove the recusion do_sd_request()/requeue_sd_request() that, in my
>    opinion should never have existed.
> 2) Make the new stuff aware of the recursion from
>    queue_command()/rw_intr() and just do nothing in that situation.

Item (1) has been fixed. I'll post them when things get settled here.
Item (2) will allow the sd driver's queue to stop feeding the lower
layers. Maybe only for a short while, but it could get stuck.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]

Reply via email to