Re: aha1542 vs. DAT vs. kernel 2.0.x

Eric Youngdale Tue, 16 May 2000 06:03:30 -0700

----- Original Message -----
From: Matthias Andree <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, May 16, 2000 5:33 AM
Subject: Re: aha1542 vs. DAT vs. kernel 2.0.x


> "Eric Youngdale" <[EMAIL PROTECTED]> writes:
>
> >     For what it is worth, I was testing the new queueing code in the 2.3
> > kernels using a 1542 and an old scratched up CDROM that gives lots of
sector
> > errors.  For this category of errors, eventually I started getting into
the
> > timeout code, and this would abort, and ultimately reset the bus.
> > Eventually device got into such a weird state that the error handling
code
> > took the device offline.  Once this took place, the machine was back to
> > normal (the root filesystem was on a SCSI disk on a different bus
> > (aic7xxx)).
>
> Sounds good so far, can that be back-ported to 2.2, and if so, who is
> doing it? I can't, I'm not a kernel hacker currently and too short of
> time to learn that.

    All of the changes are far too complicated to back-port.  There are
limited instances of bugs which I found as I was debugging things with the
above mentioned cdrom which could be trivially brought back:

    1) There is a race condition involving timeouts whereby command
completion and timeout handling could start nearly simultaneously leading to
data corruption in the mid-layer.  The solution was to watch the return
value of del_timer() during command completion - if the return value
indicates that the timer has fired, then the rest of command completion is
skipped.  If the timer has not fired, then we can be guaranteed that the
timeout handler won't be running.  The fix for this one is relatively
simple, and can probably be safely brought back to the 2.2 kernel.
    2) There were some bugs in the dispatch function that determined how to
treat various conditions.  These could also be safely back-ported.

I am enclosing a trivial (and untested) patch which corrects both of these
problems.  I can check it sometime later this week when I get some time.

-Eric

*** scsi.h.~1~ Tue Jan  4 13:12:20 2000
--- scsi.h Tue May 16 08:48:44 2000
***************
*** 593,598 ****
--- 593,606 ----
      unsigned           host_wait:1;
      unsigned           device_wait:1;

+     /*
+      * Used to indicate that a command which has timed out also
+      * completed normally.  Typically the completion function will
+      * do nothing but set this flag in this instance because the
+      * timeout handler is already running.
+      */
+     unsigned done_late:1;
+
      /* These variables are for the cdrom only. Once we have variable size
       * buffers in the buffer cache, they will go away. */
      int                this_count;
*** scsi.c.~1~ Wed May  3 20:16:44 2000
--- scsi.c Tue May 16 08:44:06 2000
***************
*** 1625,1635 ****
  void
  scsi_done (Scsi_Cmnd * SCpnt)
  {

    /*
     * We don't have to worry about this one timing out any more.
     */
!   scsi_delete_timer(SCpnt);

    /* Set the serial numbers back to zero */
    SCpnt->serial_number = 0;
--- 1625,1649 ----
  void
  scsi_done (Scsi_Cmnd * SCpnt)
  {
+   int tstatus;

+
    /*
     * We don't have to worry about this one timing out any more.
     */
!   tstatus = scsi_delete_timer(SCpnt);
!
!   /*
!    * If we are unable to remove the timer, it means that the command
!    * has already timed out.  In this case, we have no choice but to
!    * let the timeout function run, as we have no idea where in fact
!    * that function could really be.  It might be on another processor,
!    * etc, etc.
!    */
!   if (!tstatus) {
!     SCpnt->done_late = 1;
!     return;
!   }

    /* Set the serial numbers back to zero */
    SCpnt->serial_number = 0;
*** scsi_error.c.~1~ Tue May 16 08:41:48 2000
--- scsi_error.c Tue May 16 08:46:54 2000
***************
*** 114,119 ****
--- 114,121 ----
      SCset->eh_timeout.expires = jiffies + timeout;
      SCset->eh_timeout.function = (void (*)(unsigned long))complete;

+     SCset->done_late = 0;
+
      SCSI_LOG_ERROR_RECOVERY(5,printk("Adding timer for command %p at %d
(%p)\n", SCset, timeout, complete));

      add_timer(&SCset->eh_timeout);
***************
*** 127,133 ****
   *
   * Arguments:   SCset   - command that we are canceling timer for.
   *
!  * Returns:     Amount of time remaining before command would have timed
out.
   *
   * Notes: This should be turned into an inline function.
   */
--- 129,137 ----
   *
   * Arguments:   SCset   - command that we are canceling timer for.
   *
!  * Returns:     1 if we were able to detach the timer.  0 if we
!  *              blew it, and the timer function has already started
!  *              to run.
   *
   * Notes: This should be turned into an inline function.
   */
***************
*** 136,143 ****
  {
    int rtn;

!   rtn = jiffies - SCset->eh_timeout.expires;
!   del_timer(&SCset->eh_timeout);

    SCSI_LOG_ERROR_RECOVERY(5,printk("Clearing timer for command %p\n",
SCset));

--- 140,146 ----
  {
    int rtn;

!   rtn = del_timer(&SCset->eh_timeout);

    SCSI_LOG_ERROR_RECOVERY(5,printk("Clearing timer for command %p\n",
SCset));

***************
*** 321,326 ****
--- 324,343 ----
  STATIC
  void scsi_eh_done (Scsi_Cmnd * SCpnt)
  {
+   int     rtn;
+
+   /*
+    * If the timeout handler is already running, then just set the
+    * flag which says we finished late, and return.  We have no
+    * way of stopping the timeout handler from running, so we must
+    * always defer to it.
+    */
+   rtn = del_timer(&SCpnt->eh_timeout);
+   if (!rtn) {
+     SCpnt->done_late = 1;
+     return;
+   }
+
    SCpnt->request.rq_status = RQ_SCSI_DONE;

    SCpnt->owner = SCSI_OWNER_ERROR_HANDLER;
***************
*** 622,629 ****

          SCpnt->host->eh_action = NULL;

-  del_timer(&SCpnt->eh_timeout);
-
   /*
    * See if timeout.  If so, tell the host to forget about it.
    * In other words, we don't want a callback any more.
--- 639,644 ----
***************
*** 977,986 ****
         * When the low level driver returns DID_SOFT_ERROR,
         * it is responsible for keeping an internal retry counter
         * in order to avoid endless loops (DB)
         */
      case DID_SOFT_ERROR:
!       return NEEDS_RETRY;

      case DID_BUS_BUSY:
      case DID_PARITY:
      case DID_ERROR:
--- 992,1006 ----
         * When the low level driver returns DID_SOFT_ERROR,
         * it is responsible for keeping an internal retry counter
         * in order to avoid endless loops (DB)
+        *
+        * Actually this is a bug in this function here.  We should be
+        * mindful of the maximum number of retries specified and not
+        * get stuck in a loop.
         */
      case DID_SOFT_ERROR:
!       goto maybe_retry;

+
      case DID_BUS_BUSY:
      case DID_PARITY:
      case DID_ERROR:
***************
*** 1074,1080 ****
      }
    else
      {
!       return FAILED;
      }
  }

--- 1094,1103 ----
      }
    else
      {
!       /*
!        * No more retries - report this one back to upper level.
!        */
!       return SUCCESS;
      }
  }




-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
Re: aha1542 vs. DAT vs. kernel 2.0.x

Reply via email to