subject:"Debugging scsi abort handling \?"

Re: Debugging scsi abort handling ?

2014-08-29 Thread Hannes Reinecke


On 08/29/2014 06:39 AM, Finn Thain wrote:


On Thu, 28 Aug 2014, Hannes Reinecke wrote:


What might happen, though, that the command is already dead and gone by
the time you're calling -scsi_done() (if you call it after eh_abort).
So there might not _be_ a command upon which you can call -scsi_done()
to start with.

Hence any LLDD need to clear up any internal references after a call to
eh_XXX to ensure it doesn't call -scsi_done() an in invalid command.

So even if the LLDD returns 'FAILED' upon a call to eh_XXX it _still_
needs to clear up the internal reference.


This is a question that has been bothering me too. If the host's
eh_abort_cmd() method returns FAILED, it seems the mid-layer is liable to
re-issue the same command to the LLD (?)


No.
FAILED for any eh_abort_cmd() means that the TMF hasn't been sent.
So the midlayer escalates to the next EH step.
The command will only ever be re-issued once EH completes.


Either that or return 'FAILED' for any later eh_XXX function until the
internal references can be cleared up.


So if a command may or may not exist after eh_abort_handler() returns
control to the mid-layer (regardless of SUCCESS or FAILURE), then the LLD
has to be careful about keeping track of which commands were aborted, if
those commands are still in the process of cleanup when eh_abort_handler()
returns.


Yes.


It's hard to see how that can work when command pointers are only unique
while a command exists.

Which is why we have the EH callbacks, to give the LLDD a chance to 
clean up internal references.



In effect, this would mean that EH functions cannot return at all, until
the relevant command(s) are completely forgotten by the LLD; and that
means the LLD itself may have to escalate abort - device reset - bus
reset - etc instead of simply returning FAILED.

More often than not the LLDD has its own internal command structure, 
which reference the midlayer SCSI command structure via a pointer.

Just clearing that pointer will do the trick.

Take eg. lpfc:
It'll construct its internal command here:

lpfc_cmd = lpfc_get_scsi_buf(phba, ndlp);
if (lpfc_cmd == NULL) {
lpfc_rampdown_queue_depth(phba);

lpfc_printf_vlog(vport, KERN_INFO, LOG_FCP,
 0707 driver's buffer pool is empty, 
 IO busied\n);
goto out_host_busy;
}

/*
 * Store the midlayer's command structure for the
 * completion phase
 * and complete the command initialization.
 */
lpfc_cmd-pCmd  = cmnd;
lpfc_cmd-rdata = rdata;
lpfc_cmd-timeout = 0;
lpfc_cmd-start_time = jiffies;
cmnd-host_scribble = (unsigned char *)lpfc_cmd;

and then checks for the pointer upon command completion:

static void
lpfc_scsi_cmd_iocb_cmpl(struct lpfc_hba *phba, struct lpfc_iocbq 
*pIocbIn,

struct lpfc_iocbq *pIocbOut)
{
struct lpfc_scsi_buf *lpfc_cmd =
(struct lpfc_scsi_buf *) pIocbIn-context1;

[ .. ]
/* Sanity check on return of outstanding command */
if (!(lpfc_cmd-pCmd))
return;

But indeed, 'FAILED' is not very meaningful here, leaving the 
midlayer with no information about what happened to the command.


Personally I would like to enforce this meaning on the eh_XXX callbacks:
- upon each eh_XXX callback the LLDD clears any internal references
  to the command / command scope (ie eh_abort_cmd clears the
  references to the command, eh_lun_reset clears all internal
  references to commands to this ITL nexus etc.)
  This happens irrespective of the return code.
- The eh_XXX callback shall return 'FAILED' if the respective
  TMF (or equivalent) could not be initiated.
- The eh_XXX callback shall return 'SUCCESS' if the respective
  TMF (or equvalent) could be initiated.
- After each eh_XXX callback control for this command / command
  scope is transferred back to the midlayer; the LLDD shall not
  assume the associated command structures to remain valid after
  that point.

I'm tempted to enshrine this in the documentation;
that surely will help me during the EH cleanup.
And Hans will have some guidelines on how to design uas EH :-)

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-29 Thread Paolo Bonzini

Il 29/08/2014 08:08, Hannes Reinecke ha scritto:

 No.
 FAILED for any eh_abort_cmd() means that the TMF hasn't been sent.
 So the midlayer escalates to the next EH step.
 The command will only ever be re-issued once EH completes.

Then the answer to Hans's question is yes.  It is legal to call
-scsi_done() after the eh_abort handler returns FAILED.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-29 Thread Finn Thain


On Fri, 29 Aug 2014, Hannes Reinecke wrote:

 On 08/29/2014 06:39 AM, Finn Thain wrote:
 
  On Thu, 28 Aug 2014, Hannes Reinecke wrote:
 
   What might happen, though, that the command is already dead and gone 
   by the time you're calling -scsi_done() (if you call it after 
   eh_abort). So there might not _be_ a command upon which you can call
   -scsi_done() to start with.
  
   Hence any LLDD need to clear up any internal references after a call 
   to eh_XXX to ensure it doesn't call -scsi_done() an in invalid 
   command.
  
   So even if the LLDD returns 'FAILED' upon a call to eh_XXX it 
   _still_ needs to clear up the internal reference.
 
  This is a question that has been bothering me too. If the host's 
  eh_abort_cmd() method returns FAILED, it seems the mid-layer is liable 
  to re-issue the same command to the LLD (?)
 
 No.
 FAILED for any eh_abort_cmd() means that the TMF hasn't been sent.

Makes sense, though it appears to contradict this advice about returning 
SUCCESS in some situations: 
http://marc.info/?l=linux-scsim=140923498632496w=2

 The command will only ever be re-issued once EH completes.

...

 
 But indeed, 'FAILED' is not very meaningful here, leaving the midlayer 
 with no information about what happened to the command.
 
 Personally I would like to enforce this meaning on the eh_XXX callbacks:
 - upon each eh_XXX callback the LLDD clears any internal references
   to the command / command scope (ie eh_abort_cmd clears the
   references to the command, eh_lun_reset clears all internal
   references to commands to this ITL nexus etc.)
   This happens irrespective of the return code.
 - The eh_XXX callback shall return 'FAILED' if the respective
   TMF (or equivalent) could not be initiated.
 - The eh_XXX callback shall return 'SUCCESS' if the respective
   TMF (or equvalent) could be initiated.
 - After each eh_XXX callback control for this command / command
   scope is transferred back to the midlayer; the LLDD shall not
   assume the associated command structures to remain valid after
   that point.

Perhaps that last constraint should be relaxed to After the final EH 
callback (whether implemented or unimplemented by the host), command / 
command scope is transferred back to the midlayer...

A more severe TMF is probably mandatory (e.g. bus reset) but if the driver 
author later added a milder one (e.g. bus device reset), your rule would 
mean that the existing handler would then operate under new constraints, 
which might cause surprises.

 [...] I'm tempted to enshrine this in the documentation;

It is helpful, thanks.

-- 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-29 Thread Hannes Reinecke


On 08/29/2014 12:14 PM, Finn Thain wrote:


On Fri, 29 Aug 2014, Hannes Reinecke wrote:


On 08/29/2014 06:39 AM, Finn Thain wrote:


On Thu, 28 Aug 2014, Hannes Reinecke wrote:


What might happen, though, that the command is already dead and gone
by the time you're calling -scsi_done() (if you call it after
eh_abort). So there might not _be_ a command upon which you can call
-scsi_done() to start with.

Hence any LLDD need to clear up any internal references after a call
to eh_XXX to ensure it doesn't call -scsi_done() an in invalid
command.

So even if the LLDD returns 'FAILED' upon a call to eh_XXX it
_still_ needs to clear up the internal reference.


This is a question that has been bothering me too. If the host's
eh_abort_cmd() method returns FAILED, it seems the mid-layer is liable
to re-issue the same command to the LLD (?)


No.
FAILED for any eh_abort_cmd() means that the TMF hasn't been sent.


Makes sense, though it appears to contradict this advice about returning
SUCCESS in some situations:
http://marc.info/?l=linux-scsim=140923498632496w=2

Well, if the LLDD detects an invalid command (ie if it cannot find 
any internal command matching the midlayer command) that's an 
automatic success, obviously.


So we should rephrase things to:

- The eh_XXX callback shall return 'SUCCESS' if the respective
  TMF (or equvalent) could be initiated or if the matching command
  reference has already been completed by the LLDD. Otherwise
  the eh_XXX callback shall return 'FAILED'.


The command will only ever be re-issued once EH completes.


...



But indeed, 'FAILED' is not very meaningful here, leaving the midlayer
with no information about what happened to the command.

Personally I would like to enforce this meaning on the eh_XXX callbacks:
- upon each eh_XXX callback the LLDD clears any internal references
   to the command / command scope (ie eh_abort_cmd clears the
   references to the command, eh_lun_reset clears all internal
   references to commands to this ITL nexus etc.)
   This happens irrespective of the return code.
- The eh_XXX callback shall return 'FAILED' if the respective
   TMF (or equivalent) could not be initiated.
- The eh_XXX callback shall return 'SUCCESS' if the respective
   TMF (or equvalent) could be initiated.
- After each eh_XXX callback control for this command / command
   scope is transferred back to the midlayer; the LLDD shall not
   assume the associated command structures to remain valid after
   that point.


Perhaps that last constraint should be relaxed to After the final EH
callback (whether implemented or unimplemented by the host), command /
command scope is transferred back to the midlayer...


No, that's wrong.

By the time any eh_XXX callbacks are triggered control _is_ already 
back at the midlayer. IE the command timeout triggered and the block 
layer already set the REQ_ATOM_COMPLETED flag, short-circuiting any 
attempts to call -scsi_done().
So with the callbacks the midlayer actually informs the LLDD about a 
certain fact; there is nothing the LLDD can do to change ownership 
at that point.


(Correction: During the call of any eh_XXX callbacks control _is_ 
back at the LLDD, otherwise the callbacks would be pointless. It's

just that the LLDD shouldn't assume the command is valid _after_
any of the eh_XXX callbacks has terminated.)


A more severe TMF is probably mandatory (e.g. bus reset) but if the driver
author later added a milder one (e.g. bus device reset), your rule would
mean that the existing handler would then operate under new constraints,
which might cause surprises.


Well, _if_ we were to adopt this rule we obviously have to audit
existing LLDDs if the rule is followed, and tweak them if not.

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-29 Thread Hans de Goede

Hi,

On 08/29/2014 12:30 PM, Hannes Reinecke wrote:
 On 08/29/2014 12:14 PM, Finn Thain wrote:

 On Fri, 29 Aug 2014, Hannes Reinecke wrote:

 On 08/29/2014 06:39 AM, Finn Thain wrote:

 On Thu, 28 Aug 2014, Hannes Reinecke wrote:

 What might happen, though, that the command is already dead and gone
 by the time you're calling -scsi_done() (if you call it after
 eh_abort). So there might not _be_ a command upon which you can call
 -scsi_done() to start with.

 Hence any LLDD need to clear up any internal references after a call
 to eh_XXX to ensure it doesn't call -scsi_done() an in invalid
 command.

 So even if the LLDD returns 'FAILED' upon a call to eh_XXX it
 _still_ needs to clear up the internal reference.

 This is a question that has been bothering me too. If the host's
 eh_abort_cmd() method returns FAILED, it seems the mid-layer is liable
 to re-issue the same command to the LLD (?)

 No.
 FAILED for any eh_abort_cmd() means that the TMF hasn't been sent.

 Makes sense, though it appears to contradict this advice about returning
 SUCCESS in some situations:
 http://marc.info/?l=linux-scsim=140923498632496w=2

 Well, if the LLDD detects an invalid command (ie if it cannot find any 
 internal command matching the midlayer command) that's an automatic success, 
 obviously.
 
 So we should rephrase things to:
 
 - The eh_XXX callback shall return 'SUCCESS' if the respective
   TMF (or equvalent) could be initiated or if the matching command
   reference has already been completed by the LLDD. Otherwise
   the eh_XXX callback shall return 'FAILED'.

Your talking about could be initiated, so that means that at this
point the abort does not yet have to be completed, do I get that
right? What should the LLDD then do when the abort finishes,
call eh_scsi_done on the cmnd ?

What about the abort never finishing (timeout), does the mid layer
track this, or should the LLDD do that?

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hannes Reinecke


On 08/25/2014 01:15 PM, Paolo Bonzini wrote:

Il 25/08/2014 12:28, Bart Van Assche ha scritto:


 From SPC-4: 7.5.8 Control mode page [ ... ] A task aborted status (TAS)
bit set to zero specifies that aborted commands shall be terminated by
the device server without any response to the application client. A TAS
bit set to one specifies that commands aborted by the actions of an I_T
nexus other than the I_T nexus on which the command was received shall
be completed with TASK ABORTED status (see SAM-5).


Note the aborted by the actions of an I_T nexus other than the I_T
nexus on which the command was received.

In practice, this means that TASK ABORTED should only happen if you use
the CLEAR TASK SET tmf and TST is not set to 001b (i.e. _not_ to per
I_T nexus) in the Control mode page.  It should never happen for a pen
drive.

Setting TASK ABORTED aside, the important part is that an abort can do
one of two things:

- complete the command, and then eh_abort should return after the driver
has noticed the completion and called the -scsi_done callback for the
Scsi_Cmnd*.

- abort the command, and then the driver should never call the
-scsi_done callback for the Scsi_Cmnd*.

In practice we rely on the latter behaviour; when -scsi_done is 
called while the command is under eh_abort _really bad things_

will happen.
As soon as eh_abort is called control is transferred back to the
SCSI midlayer, so any LLDD should never send completions for these
commands back to the midlayer.

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hannes Reinecke


On 08/26/2014 09:19 PM, Hans de Goede wrote:

Hi,

On 08/26/2014 08:34 PM, James Bottomley wrote:

On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote:

Hi,

On 08/25/2014 05:41 PM, James Bottomley wrote:

On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:

Il 25/08/2014 13:26, Hans de Goede ha scritto:

Thanks Bart and Paolo, your insights into this are greatly
appreciated.

So with uas there are separate usb transaction for cmd, data
in, data out
and sense for each tag. At the time of abort, usually one of
data in / data
out and a sense usb transaction will be outstanding.

There already is logic in the driver to kill the data in / out
transactions
if a sense gets returned (usually with an error) before they
are done.

So if I'm reading this correctly, then on a successful abort,
the sense
transaction (if not already completed by the target) should be
cancelled as
it will never complete, correct ?


Yes.  More precisely, once the response IU comes back for the
abort,
there should be no more IUs for that task.  Figure 13
(Multiple Command
Example) in the UAS spec actually shows that.

At least, that's what it should be like in theory.  I suspect
firmware
bugs will abound in this area, but at least you shouldn't be
actively
expecting an IU for an aborted task.


Just a note on this.  The abort area in all devices is where we
have
scope for spec compliance issues.  Also in the old days abort
itself
could trigger a firmware crash on some devices (including
arrays).  The
problem is actually that the system is usually groaning because
it's out
of memory within the controller.  That actually means that
sending in
another task (the abort) causes greater pressure.  For this
reason, some
device drivers choose to skip the abort step and head straight
to reset.
For reset, you guarantee that all outstanding tasks for the unit
are
killed.


Hmm, I like this idea, given the finickiness of the abort path in
the uas
driver, and that:

1) We really have no proper way to test this
2) We already have some known issues there (we don't kill sense
urbs atm,
and I've a note somewhere about a double free on some corner
case
where an urb submit fails)
3)
3) In all the cases where I've managed to trip op an uas device
the only
thing which actually worked to recover things was doing a usb
reset
4) Aborts should not happen in the first place, so using a big
hammer
when they do is not really a big problem, instead we should
focus
on figuring out why they happen and fix the cause

I think that just dropping abort handling altogether is a good
idea actually,
so if there are no objections I'm just going to do that.

I can simply not set eh_abort_handler (aka set it to NULL), right ?


Just so you know, if you do this, error handling becomes much more
painful.  The abort handler can now handle aborting and retrying
without
pausing the host.  The reset definitely stops the host and causes
a big
and noticeable burp in processing.  However, there are hosts which
have
to do it.


I understand, but shouldn't aborts be something which rarely happens.
I guess that with a faulty drive they happen more often, but at this
point in time the uas driver's error handling problems are mostly with
dealing with timeouts during probing, which are likely caused by
the device not grokking some command we've send, and responding to
this in a bad manner (hanging mostly).

So I'm tempted to just remove the abort handling, and first focus on
getting uas stable under all conditions. Once it is stable we can
look into making it deal more graciously with errors.


Sounds like a reasonable plan.

[ .. ]


Wait, could this also be your abort problem?  In the running abort
handler, we could get a scenario where we abort a bunch of
commands at the same time.


I don't think so, the uas code does not return from eh_abort_handler
until the abort has either succeeded or timedout (for which a 3 second
timeout is used). AFAIK the scsi core will always issue multiple aborts
sequentially, right. And if the abort succeeds then the tag is free
again for the next abort. Only if an abort fails (times out, which is
what I've seen in all the error conditions which I've encountered so
far),
then will subsequent aborts, and the logical unit reset, fail due to
there not being a free tag to use for them.

If the logic for command/task abort and logical unit reset is 
similar then there is no point in implementing both.
So by what I've seen I would first implement target reset (and host 
reset :-), and worry about finer grained error control later on.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hans de Goede

Hi,

On 08/28/2014 02:10 PM, Hannes Reinecke wrote:
 On 08/26/2014 09:19 PM, Hans de Goede wrote:
 Hi,

 On 08/26/2014 08:34 PM, James Bottomley wrote:
 On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote:
 Hi,

 On 08/25/2014 05:41 PM, James Bottomley wrote:
 On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:
 Il 25/08/2014 13:26, Hans de Goede ha scritto:
 Thanks Bart and Paolo, your insights into this are greatly
 appreciated.

 So with uas there are separate usb transaction for cmd, data
 in, data out
 and sense for each tag. At the time of abort, usually one of
 data in / data
 out and a sense usb transaction will be outstanding.

 There already is logic in the driver to kill the data in / out
 transactions
 if a sense gets returned (usually with an error) before they
 are done.

 So if I'm reading this correctly, then on a successful abort,
 the sense
 transaction (if not already completed by the target) should be
 cancelled as
 it will never complete, correct ?

 Yes.  More precisely, once the response IU comes back for the
 abort,
 there should be no more IUs for that task.  Figure 13
 (Multiple Command
 Example) in the UAS spec actually shows that.

 At least, that's what it should be like in theory.  I suspect
 firmware
 bugs will abound in this area, but at least you shouldn't be
 actively
 expecting an IU for an aborted task.

 Just a note on this.  The abort area in all devices is where we
 have
 scope for spec compliance issues.  Also in the old days abort
 itself
 could trigger a firmware crash on some devices (including
 arrays).  The
 problem is actually that the system is usually groaning because
 it's out
 of memory within the controller.  That actually means that
 sending in
 another task (the abort) causes greater pressure.  For this
 reason, some
 device drivers choose to skip the abort step and head straight
 to reset.
 For reset, you guarantee that all outstanding tasks for the unit
 are
 killed.

 Hmm, I like this idea, given the finickiness of the abort path in
 the uas
 driver, and that:

 1) We really have no proper way to test this
 2) We already have some known issues there (we don't kill sense
 urbs atm,
 and I've a note somewhere about a double free on some corner
 case
 where an urb submit fails)
 3)
 3) In all the cases where I've managed to trip op an uas device
 the only
 thing which actually worked to recover things was doing a usb
 reset
 4) Aborts should not happen in the first place, so using a big
 hammer
 when they do is not really a big problem, instead we should
 focus
 on figuring out why they happen and fix the cause

 I think that just dropping abort handling altogether is a good
 idea actually,
 so if there are no objections I'm just going to do that.

 I can simply not set eh_abort_handler (aka set it to NULL), right ?

 Just so you know, if you do this, error handling becomes much more
 painful.  The abort handler can now handle aborting and retrying
 without
 pausing the host.  The reset definitely stops the host and causes
 a big
 and noticeable burp in processing.  However, there are hosts which
 have
 to do it.

 I understand, but shouldn't aborts be something which rarely happens.
 I guess that with a faulty drive they happen more often, but at this
 point in time the uas driver's error handling problems are mostly with
 dealing with timeouts during probing, which are likely caused by
 the device not grokking some command we've send, and responding to
 this in a bad manner (hanging mostly).

 So I'm tempted to just remove the abort handling, and first focus on
 getting uas stable under all conditions. Once it is stable we can
 look into making it deal more graciously with errors.

 Sounds like a reasonable plan.
 
 [ .. ]

 Wait, could this also be your abort problem?  In the running abort
 handler, we could get a scenario where we abort a bunch of
 commands at the same time.

 I don't think so, the uas code does not return from eh_abort_handler
 until the abort has either succeeded or timedout (for which a 3 second
 timeout is used). AFAIK the scsi core will always issue multiple aborts
 sequentially, right. And if the abort succeeds then the tag is free
 again for the next abort. Only if an abort fails (times out, which is
 what I've seen in all the error conditions which I've encountered so
 far),
 then will subsequent aborts, and the logical unit reset, fail due to
 there not being a free tag to use for them.

 If the logic for command/task abort and logical unit reset is similar then 
 there is no point in implementing both.

The logic is different in that when an abort gets issued other commands may
(on paper) still complete normally, where as with a lun reset all commands
on that lun (and usually there is only one lun) are toast.

 So by what I've seen I would first implement target reset (and host reset :-)

With uas target == host, and that is already implemented and so far seems
to be the only thing which actually

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hans de Goede

Hi,

On 08/28/2014 02:17 PM, Paolo Bonzini wrote:
 Il 28/08/2014 14:04, Hannes Reinecke ha scritto:

 Setting TASK ABORTED aside, the important part is that an abort can do
 one of two things:

 - complete the command, and then eh_abort should return after the driver
 has noticed the completion and called the -scsi_done callback for the
 Scsi_Cmnd*.

 - abort the command, and then the driver should never call the
 -scsi_done callback for the Scsi_Cmnd*.

 In practice we rely on the latter behaviour; when -scsi_done is called
 while the command is under eh_abort _really bad things_
 will happen.
 As soon as eh_abort is called control is transferred back to the
 SCSI midlayer, so any LLDD should never send completions for these
 commands back to the midlayer.
 
 No, this is wrong.  I think we have sorted it out a couple of months
 ago.  virtio-scsi for example (due to QEMU quirks) will do the former
 more often than not.
 
 Ignoring scsi_eh_done which is just as harmless, 

scsi_eh_done is new to me. Should I ever use that, and if so when ?

 -scsi_done does
 nothing more than calling blk_complete_request.  If the command is under
 abort, it has already been marked as complete by the block layer's
 timeout timer---see blk_rq_timed_out_timer and blk_rq_check_expired---or
 by blk_abort_request.
 
 Then, blk_complete_request will do nothing because its call to
 blk_mark_rq_complete returns true.
 
 All this, of course, as long as -scsi_done is called _before_ eh_abort
 returns.

What about calling scsi_done after eh_abort if eh_abort returned FAILED?

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Martin Peschke

On Thu, 2014-08-28 at 14:04 +0200, Hannes Reinecke wrote:
 On 08/25/2014 01:15 PM, Paolo Bonzini wrote:
  - abort the command, and then the driver should never call the
  -scsi_done callback for the Scsi_Cmnd*.
 
 In practice we rely on the latter behaviour; when -scsi_done is 
 called while the command is under eh_abort _really bad things_
 will happen.
 As soon as eh_abort is called control is transferred back to the
 SCSI midlayer, so any LLDD should never send completions for these
 commands back to the midlayer.

Mmh, then there is a small race window starting at the point in time
when the abort function of an LLDD is called up to the point in time
when that abort function has taken the necessary measures to make sure
that scsi_done won't be called for a racing command completion , isn't
it?

Cheers
Martin

-- 
Linux on System z Development

IBM Deutschland Research  Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Paolo Bonzini

Il 28/08/2014 14:26, Hans de Goede ha scritto:
  Then, blk_complete_request will do nothing because its call to
  blk_mark_rq_complete returns true.
  
  All this, of course, as long as -scsi_done is called _before_ eh_abort
  returns.
 What about calling scsi_done after eh_abort if eh_abort returned FAILED?

I invoke the fifth amendment. :)

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hans de Goede

Hi,

On 08/28/2014 02:33 PM, Paolo Bonzini wrote:
 Il 28/08/2014 14:26, Hans de Goede ha scritto:
 Then, blk_complete_request will do nothing because its call to
 blk_mark_rq_complete returns true.

 All this, of course, as long as -scsi_done is called _before_ eh_abort
 returns.
 What about calling scsi_done after eh_abort if eh_abort returned FAILED?
 
 I invoke the fifth amendment. :)

Although I appreciate the tongue in cheek answer, this was sort of a serious
question, as at the moment this may happen with the uas driver.

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread James Bottomley

On Thu, 2014-08-28 at 14:37 +0200, Hans de Goede wrote:
 Hi,
 
 On 08/28/2014 02:33 PM, Paolo Bonzini wrote:
  Il 28/08/2014 14:26, Hans de Goede ha scritto:
  Then, blk_complete_request will do nothing because its call to
  blk_mark_rq_complete returns true.
 
  All this, of course, as long as -scsi_done is called _before_ eh_abort
  returns.
  What about calling scsi_done after eh_abort if eh_abort returned FAILED?
  
  I invoke the fifth amendment. :)
 
 Although I appreciate the tongue in cheek answer, this was sort of a serious
 question, as at the moment this may happen with the uas driver.

The answer is no.  It's part of the internal mid layer functions you
don't need to know about.  You just call the command done functions and
try not to care what actual function they point to.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread James Bottomley

On Thu, 2014-08-28 at 14:21 +0200, Hans de Goede wrote:
 Hi,
 
 On 08/28/2014 02:04 PM, Hannes Reinecke wrote:
  On 08/25/2014 01:15 PM, Paolo Bonzini wrote:
  Il 25/08/2014 12:28, Bart Van Assche ha scritto:
 
   From SPC-4: 7.5.8 Control mode page [ ... ] A task aborted status (TAS)
  bit set to zero specifies that aborted commands shall be terminated by
  the device server without any response to the application client. A TAS
  bit set to one specifies that commands aborted by the actions of an I_T
  nexus other than the I_T nexus on which the command was received shall
  be completed with TASK ABORTED status (see SAM-5).
 
  Note the aborted by the actions of an I_T nexus other than the I_T
  nexus on which the command was received.
 
  In practice, this means that TASK ABORTED should only happen if you use
  the CLEAR TASK SET tmf and TST is not set to 001b (i.e. _not_ to per
  I_T nexus) in the Control mode page.  It should never happen for a pen
  drive.
 
  Setting TASK ABORTED aside, the important part is that an abort can do
  one of two things:
 
  - complete the command, and then eh_abort should return after the driver
  has noticed the completion and called the -scsi_done callback for the
  Scsi_Cmnd*.
 
  - abort the command, and then the driver should never call the
  -scsi_done callback for the Scsi_Cmnd*.
 
  In practice we rely on the latter behaviour; when -scsi_done is called 
  while the command is under eh_abort _really bad things_
  will happen.
 
 Interesting, those very bad things may very well be exactly the things
 some uas users are seeing. But this sounds racy, I can stop a command
 from completing as the very first thing inside eh_abort, but I cannot
 stop it from completing when the scsi core is getting ready to call
 eh_abort, but eh_abort is not yet called.
 
 Is there some flag I should check before calling scsi_done to avoid this
 race? And if so which locks should I hold (and why does scsi_done not do
 this check itself) ?
 
  As soon as eh_abort is called control is transferred back to the
  SCSI midlayer, so any LLDD should never send completions for these
  commands back to the midlayer.
 
 I'm fine with not calling scsi_done from eh_abort, but I cannot guarantee
 that another thread will not complete the cmnd in the mean time before hand.

This is expected.  After error handling begins, the block layer ignores
all done callbacks for commands under EH.  You can get the situation
where we try to abort (or do other EH) on a command you think you've
already completed, but you just return success for that.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hannes Reinecke


On 08/28/2014 02:37 PM, Hans de Goede wrote:

Hi,

On 08/28/2014 02:33 PM, Paolo Bonzini wrote:

Il 28/08/2014 14:26, Hans de Goede ha scritto:

Then, blk_complete_request will do nothing because its call to
blk_mark_rq_complete returns true.

All this, of course, as long as -scsi_done is called _before_ eh_abort
returns.

What about calling scsi_done after eh_abort if eh_abort returned FAILED?


I invoke the fifth amendment. :)


Although I appreciate the tongue in cheek answer, this was sort of a serious
question, as at the moment this may happen with the uas driver.


As mentioned earlier, as soon as SCSI EH is invoked control
is assumed to be transferred back to the SCSI midlayer.
How the midlayer interprets any return value from the various eh_XX
callbacks is immaterial to the LLDD.

So even if the eh_abort returns FAILED control is still with the SCSI 
midlayer, so the earlier statements apply.
IE the command will be short-circuited by the block layer anyway if 
-scsi_done() is called.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hannes Reinecke


On 08/28/2014 02:31 PM, Martin Peschke wrote:

On Thu, 2014-08-28 at 14:04 +0200, Hannes Reinecke wrote:

On 08/25/2014 01:15 PM, Paolo Bonzini wrote:

- abort the command, and then the driver should never call the
-scsi_done callback for the Scsi_Cmnd*.


In practice we rely on the latter behaviour; when -scsi_done is
called while the command is under eh_abort _really bad things_
will happen.
As soon as eh_abort is called control is transferred back to the
SCSI midlayer, so any LLDD should never send completions for these
commands back to the midlayer.


Mmh, then there is a small race window starting at the point in time
when the abort function of an LLDD is called up to the point in time
when that abort function has taken the necessary measures to make sure
that scsi_done won't be called for a racing command completion , isn't
it?

No, that's fine. The timeout function is under control of the block 
layer, which sets an atomic flag on the request before calling the 
timeout function.
And -scsi_done() essentially just calls 'blk_complete_request();, which 
checks the very same flag before raising the block soft irq.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Paolo Bonzini

Il 28/08/2014 16:17, Hannes Reinecke ha scritto:

 As mentioned earlier, as soon as SCSI EH is invoked control
 is assumed to be transferred back to the SCSI midlayer.
 How the midlayer interprets any return value from the various eh_XX
 callbacks is immaterial to the LLDD.
 
 So even if the eh_abort returns FAILED control is still with the SCSI
 midlayer, so the earlier statements apply.
 IE the command will be short-circuited by the block layer anyway if
 -scsi_done() is called.

As I parsed it, the question is not whether the short-circuiting will
happen.  It's whether you will have use-after-free bugs or not if you
call -scsi_done() after eh_abort returns FAILED.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Hannes Reinecke


On 08/28/2014 04:56 PM, Paolo Bonzini wrote:

Il 28/08/2014 16:17, Hannes Reinecke ha scritto:



As mentioned earlier, as soon as SCSI EH is invoked control
is assumed to be transferred back to the SCSI midlayer.
How the midlayer interprets any return value from the various eh_XX
callbacks is immaterial to the LLDD.

So even if the eh_abort returns FAILED control is still with the SCSI
midlayer, so the earlier statements apply.
IE the command will be short-circuited by the block layer anyway if
-scsi_done() is called.


As I parsed it, the question is not whether the short-circuiting will
happen.  It's whether you will have use-after-free bugs or not if you
call -scsi_done() after eh_abort returns FAILED.

Paolo


No. Once eh_abort is called control is back with the SCSI midlayer.
(Read: REQ_ATOM_COMPLETE is set in req-atomic_flags).
So you can call -scsi_done() at your hearts content and nothing will 
happen.
What might happen, though, that the command is already dead and gone by 
the time you're calling -scsi_done() (if you call it after eh_abort).

So there might not _be_ a command upon which you can call -scsi_done()
to start with.

Hence any LLDD need to clear up any internal references after a call to 
eh_XXX to ensure it doesn't call -scsi_done() an in invalid command.


So even if the LLDD returns 'FAILED' upon a call to eh_XXX it _still_ 
needs to clear up the internal reference.
Either that or return 'FAILED' for any later eh_XXX function until the 
internal references can be cleared up.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Debugging scsi abort handling ?

2014-08-28 Thread Elliott, Robert (Server Storage)

 -Original Message-
 From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
 ow...@vger.kernel.org] On Behalf Of Hannes Reinecke
 Sent: Thursday, 28 August, 2014 10:13 AM
 To: Paolo Bonzini; Hans de Goede; Bart Van Assche; SCSI development
 list
 Subject: Re: Debugging scsi abort handling ?

 On 08/28/2014 04:56 PM, Paolo Bonzini wrote:
  Il 28/08/2014 16:17, Hannes Reinecke ha scritto:

  As mentioned earlier, as soon as SCSI EH is invoked control
  is assumed to be transferred back to the SCSI midlayer.
  How the midlayer interprets any return value from the various
 eh_XX
  callbacks is immaterial to the LLDD.

  So even if the eh_abort returns FAILED control is still with the
 SCSI
  midlayer, so the earlier statements apply.
  IE the command will be short-circuited by the block layer anyway
 if
  -scsi_done() is called.

  As I parsed it, the question is not whether the short-circuiting
 will
  happen.  It's whether you will have use-after-free bugs or not if
 you
  call -scsi_done() after eh_abort returns FAILED.

  Paolo

 No. Once eh_abort is called control is back with the SCSI midlayer.
 (Read: REQ_ATOM_COMPLETE is set in req-atomic_flags).
 So you can call -scsi_done() at your hearts content and nothing will
 happen.
 What might happen, though, that the command is already dead and gone
 by
 the time you're calling -scsi_done() (if you call it after
 eh_abort).
 So there might not _be_ a command upon which you can call -
 scsi_done()
 to start with.

 Hence any LLDD need to clear up any internal references after a call
 to
 eh_XXX to ensure it doesn't call -scsi_done() an in invalid command.

 So even if the LLDD returns 'FAILED' upon a call to eh_XXX it _still_
 needs to clear up the internal reference.
 Either that or return 'FAILED' for any later eh_XXX function until
 the
 internal references can be cleared up.

Is the block layer prevented from issuing a new command with the
same tag before the error handling is finished?

---
Rob ElliottHP Server Storage

Re: Debugging scsi abort handling ?

2014-08-28 Thread Paolo Bonzini

Il 28/08/2014 17:50, Elliott, Robert (Server Storage) ha scritto:
 Is the block layer prevented from issuing a new command with the
 same tag before the error handling is finished?

Tags are chosen by the LLDs, so it's up to it to pick the right tags.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Christoph Hellwig

On Thu, Aug 28, 2014 at 05:54:50PM +0200, Paolo Bonzini wrote:
 Il 28/08/2014 17:50, Elliott, Robert (Server Storage) ha scritto:
  Is the block layer prevented from issuing a new command with the
  same tag before the error handling is finished?
 
 Tags are chosen by the LLDs, so it's up to it to pick the right tags.

When using scsi-mq or the older block level tagging implementation they
aren't.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Finn Thain


On Thu, 28 Aug 2014, James Bottomley wrote:

  I'm fine with not calling scsi_done from eh_abort, but I cannot 
  guarantee that another thread will not complete the cmnd in the mean 
  time before hand.
 
 This is expected.  After error handling begins, the block layer ignores 
 all done callbacks for commands under EH.

This thread seems like FAQ's to me (I was looking for answers myself). It 
would be nice if the EH docs would clearly state something like, calling 
scsi_done on a command being aborted is a no-op.

I found your post to this thread helpful,
http://www.spinics.net/lists/linux-scsi/msg77305.html
although it does raise more questions.

Given that eh_abort_handler() may return SUCCESS or FAILURE, what are the 
implications for cmd-result? I took a look at esp_scsi to see if that 
would shed some light.

That driver does the following in eh_abort_handler().
cmd-result = DID_ABORT  16;
cmd-scsi_done(cmd);
...
return SUCCESS;
but only in certain circumstances. In others, it will wait for an active 
command to complete normally. eh_abort_handler() then returns SUCCESS if 
the command completes normally and quickly, and FAILURE if not (which 
seems semantically strange to me).

In that driver, the command waited upon in eh_abort_handler() will never 
have result host byte == DID_ABORT.

None of which makes much sense to me. Is it permissible to have normal 
command completion (host byte == DID_OK) and have eh_abort_handler() 
return SUCCESS?

Conversely, is it okay to set_host_byte(cmd, DID_ABORT) and then 
return FAILURE from eh_abort_handler()?

It would be nice if the docs could state unequivocally a aborted command 
is presumed to have no meaningful result (Is this true? If not, which 
bytes matter?)

Another subtlety is that the abort handler is apparently expected to 
perform autosense for an aborted command (or wait for that to happen 
normally) and only return after this has taken place (rather than 
immediately killing the command).

 You can get the situation where we try to abort (or do other EH) on a 
 command you think you've already completed, but you just return success 
 for that.

In general, the LLD cannot distinguish between a command previously 
completed and a command it has never encountered. But it seems that 
eh_abort_handler() should return SUCCESS either way.

That seems like an unintuitive interpretation of successfully aborted so 
it perhaps this needs to be documented too. The drivers I've looked at 
(esp_scsi and ncr5380) don't do this. If asked to abort some command which 
they completed in the past, they would return failure.

-- 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-28 Thread Finn Thain


On Thu, 28 Aug 2014, Hannes Reinecke wrote:

 What might happen, though, that the command is already dead and gone by 
 the time you're calling -scsi_done() (if you call it after eh_abort). 
 So there might not _be_ a command upon which you can call -scsi_done() 
 to start with.
 
 Hence any LLDD need to clear up any internal references after a call to 
 eh_XXX to ensure it doesn't call -scsi_done() an in invalid command.
 
 So even if the LLDD returns 'FAILED' upon a call to eh_XXX it _still_ 
 needs to clear up the internal reference.

This is a question that has been bothering me too. If the host's 
eh_abort_cmd() method returns FAILED, it seems the mid-layer is liable to 
re-issue the same command to the LLD (?)

 Either that or return 'FAILED' for any later eh_XXX function until the 
 internal references can be cleared up.

So if a command may or may not exist after eh_abort_handler() returns 
control to the mid-layer (regardless of SUCCESS or FAILURE), then the LLD 
has to be careful about keeping track of which commands were aborted, if 
those commands are still in the process of cleanup when eh_abort_handler() 
returns.

It's hard to see how that can work when command pointers are only unique 
while a command exists.

In effect, this would mean that EH functions cannot return at all, until 
the relevant command(s) are completely forgotten by the LLD; and that 
means the LLD itself may have to escalate abort - device reset - bus 
reset - etc instead of simply returning FAILED.

-- 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Debugging scsi abort handling ?

2014-08-28 Thread Elliott, Robert (Server Storage)

 -Original Message-
 From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
 ow...@vger.kernel.org] On Behalf Of Finn Thain
 Sent: Thursday, August 28, 2014 11:37 PM
 To: James Bottomley
 Cc: Hans de Goede; Hannes Reinecke; Paolo Bonzini; Bart Van Assche; SCSI
 development list
 Subject: Re: Debugging scsi abort handling ?

 On Thu, 28 Aug 2014, James Bottomley wrote:
...
 Another subtlety is that the abort handler is apparently expected to
 perform autosense for an aborted command (or wait for that to happen
 normally) and only return after this has taken place (rather than
 immediately killing the command).

Sending a REQUEST SENSE and expecting to get sense data for another
command only works in protocols like parallel SCSI that supported
contingent allegiance - a feature that is obsolete in the
SCSI architecture model.  In modern SCSI protocols, that command
just returns polled sense data like the status of background tasks,
not sense data related to any particular command.

James posted a patch a while back that bypasses sending that
command and misinterpreting the result.

  You can get the situation where we try to abort (or do other EH) on a
  command you think you've already completed, but you just return
  success for that.

 In general, the LLD cannot distinguish between a command previously
 completed and a command it has never encountered. But it seems that
 eh_abort_handler() should return SUCCESS either way.

 That seems like an unintuitive interpretation of successfully aborted
 so it perhaps this needs to be documented too. The drivers I've looked at
 (esp_scsi and ncr5380) don't do this. If asked to abort some command
 which they completed in the past, they would return failure.

That is how the SCSI architecture model defines ABORT TASK,
so shouldn't be too surprising here.

I wouldn't focus too much on those old parallel SCSI drivers; SAS, Fibre
Channel, iSCSI, and SRP drivers are better examples now.

---
Rob Elliott  HP Server Storage

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-26 Thread Hans de Goede

Hi,

On 08/25/2014 05:41 PM, James Bottomley wrote:
 On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:
 Il 25/08/2014 13:26, Hans de Goede ha scritto:
 Thanks Bart and Paolo, your insights into this are greatly appreciated.

 So with uas there are separate usb transaction for cmd, data in, data out
 and sense for each tag. At the time of abort, usually one of data in / data
 out and a sense usb transaction will be outstanding.

 There already is logic in the driver to kill the data in / out transactions
 if a sense gets returned (usually with an error) before they are done.

 So if I'm reading this correctly, then on a successful abort, the sense
 transaction (if not already completed by the target) should be cancelled as
 it will never complete, correct ?

 Yes.  More precisely, once the response IU comes back for the abort,
 there should be no more IUs for that task.  Figure 13 (Multiple Command
 Example) in the UAS spec actually shows that.

 At least, that's what it should be like in theory.  I suspect firmware
 bugs will abound in this area, but at least you shouldn't be actively
 expecting an IU for an aborted task.
 
 Just a note on this.  The abort area in all devices is where we have
 scope for spec compliance issues.  Also in the old days abort itself
 could trigger a firmware crash on some devices (including arrays).  The
 problem is actually that the system is usually groaning because it's out
 of memory within the controller.  That actually means that sending in
 another task (the abort) causes greater pressure.  For this reason, some
 device drivers choose to skip the abort step and head straight to reset.
 For reset, you guarantee that all outstanding tasks for the unit are
 killed.

Hmm, I like this idea, given the finickiness of the abort path in the uas
driver, and that:

1) We really have no proper way to test this
2) We already have some known issues there (we don't kill sense urbs atm,
   and I've a note somewhere about a double free on some corner case
   where an urb submit fails)
3)
3) In all the cases where I've managed to trip op an uas device the only
   thing which actually worked to recover things was doing a usb reset
4) Aborts should not happen in the first place, so using a big hammer
   when they do is not really a big problem, instead we should focus
   on figuring out why they happen and fix the cause

I think that just dropping abort handling altogether is a good idea actually,
so if there are no objections I'm just going to do that.

I can simply not set eh_abort_handler (aka set it to NULL), right ?

###

What about a logical unit reset ? Is that worth trying first? Or is that
likely to just trip up more firmware bugs (uas == usb == cheap == lots of
those) ?

Going straight to an usb device reset would significantly simplify the code,
but is not really friendly toward other usb drivers for a multi-function
uas device (of which I've seen no so far, but they are bound to happen).

So I guess that trying a logical unit reset first would probably be a good
idea.

I know that it works in non error conditions as I've inserted one (in my
local tree only) in the probe path to test task management functions in
general. It has never really been tested under error conditions since the
uas code can only run one task management function at a time (there is 1
tag reserved for task management functions) and in all error scenarios
I've been able to test so far, the abort timed out, so the
eh_device_reset_handler would always return FAILED (-EBUSY).

So showing my greeness wrt scsi again, what should I do with outstanding
scsi commands at this time. Can I simply cancel all outstanding usb transactions
(ie data in/out, sense) for all outstanding commands on the lun in question,
wait for the cancellations to complete, and then issue the logical unit reset ?

IOW will the lun have forgotten about any outstanding scsi commands after
a successful logical unit reset ?

And what should I do with regards to calling the scsi_done function for the
outstanding commands. I assume that when eh_device_reset_handler gets called,
the scsi_done function for any outstanding commands should not be called,
correct ?

And what about a race when a command completes after all just before
eh_device_reset_handler gets called, I assume that that is handled properly
by the scsi core ?

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-26 Thread James Bottomley

On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote:
 Hi,
 
 On 08/25/2014 05:41 PM, James Bottomley wrote:
  On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:
  Il 25/08/2014 13:26, Hans de Goede ha scritto:
  Thanks Bart and Paolo, your insights into this are greatly appreciated.
 
  So with uas there are separate usb transaction for cmd, data in, data out
  and sense for each tag. At the time of abort, usually one of data in / 
  data
  out and a sense usb transaction will be outstanding.
 
  There already is logic in the driver to kill the data in / out 
  transactions
  if a sense gets returned (usually with an error) before they are done.
 
  So if I'm reading this correctly, then on a successful abort, the sense
  transaction (if not already completed by the target) should be cancelled 
  as
  it will never complete, correct ?
 
  Yes.  More precisely, once the response IU comes back for the abort,
  there should be no more IUs for that task.  Figure 13 (Multiple Command
  Example) in the UAS spec actually shows that.
 
  At least, that's what it should be like in theory.  I suspect firmware
  bugs will abound in this area, but at least you shouldn't be actively
  expecting an IU for an aborted task.
  
  Just a note on this.  The abort area in all devices is where we have
  scope for spec compliance issues.  Also in the old days abort itself
  could trigger a firmware crash on some devices (including arrays).  The
  problem is actually that the system is usually groaning because it's out
  of memory within the controller.  That actually means that sending in
  another task (the abort) causes greater pressure.  For this reason, some
  device drivers choose to skip the abort step and head straight to reset.
  For reset, you guarantee that all outstanding tasks for the unit are
  killed.
 
 Hmm, I like this idea, given the finickiness of the abort path in the uas
 driver, and that:
 
 1) We really have no proper way to test this
 2) We already have some known issues there (we don't kill sense urbs atm,
and I've a note somewhere about a double free on some corner case
where an urb submit fails)
 3)
 3) In all the cases where I've managed to trip op an uas device the only
thing which actually worked to recover things was doing a usb reset
 4) Aborts should not happen in the first place, so using a big hammer
when they do is not really a big problem, instead we should focus
on figuring out why they happen and fix the cause
 
 I think that just dropping abort handling altogether is a good idea actually,
 so if there are no objections I'm just going to do that.
 
 I can simply not set eh_abort_handler (aka set it to NULL), right ?

Just so you know, if you do this, error handling becomes much more
painful.  The abort handler can now handle aborting and retrying without
pausing the host.  The reset definitely stops the host and causes a big
and noticeable burp in processing.  However, there are hosts which have
to do it.

 ###
 
 What about a logical unit reset ? Is that worth trying first? Or is that
 likely to just trip up more firmware bugs (uas == usb == cheap == lots of
 those) ?
 
 Going straight to an usb device reset would significantly simplify the code,
 but is not really friendly toward other usb drivers for a multi-function
 uas device (of which I've seen no so far, but they are bound to happen).
 
 So I guess that trying a logical unit reset first would probably be a good
 idea.

Yes, that's the eh_device_reset_handler() callback.

 I know that it works in non error conditions as I've inserted one (in my
 local tree only) in the probe path to test task management functions in
 general. It has never really been tested under error conditions since the
 uas code can only run one task management function at a time (there is 1
 tag reserved for task management functions)

Wait, could this also be your abort problem?  In the running abort
handler, we could get a scenario where we abort a bunch of commands at
the same time.

  and in all error scenarios
 I've been able to test so far, the abort timed out, so the
 eh_device_reset_handler would always return FAILED (-EBUSY).
 
 So showing my greeness wrt scsi again, what should I do with outstanding
 scsi commands at this time. Can I simply cancel all outstanding usb 
 transactions
 (ie data in/out, sense) for all outstanding commands on the lun in question,
 wait for the cancellations to complete, and then issue the logical unit reset 
 ?

For successful aborts, you can cancel the aborted commands (the device
should have forgotten about them).  For a successful device (lun) reset,
you can cancel all the commands on that LUN; for a target reset you
cancel all the commands on every LUN and so on.

 IOW will the lun have forgotten about any outstanding scsi commands after
 a successful logical unit reset ?

Yes.

 And what should I do with regards to calling the scsi_done function for the
 outstanding commands. I assume that

Re: Debugging scsi abort handling ?

2014-08-26 Thread Hans de Goede


Hi,

On 08/26/2014 08:34 PM, James Bottomley wrote:

On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote:

Hi,

On 08/25/2014 05:41 PM, James Bottomley wrote:

On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:

Il 25/08/2014 13:26, Hans de Goede ha scritto:

Thanks Bart and Paolo, your insights into this are greatly appreciated.

So with uas there are separate usb transaction for cmd, data in, data out
and sense for each tag. At the time of abort, usually one of data in / data
out and a sense usb transaction will be outstanding.

There already is logic in the driver to kill the data in / out transactions
if a sense gets returned (usually with an error) before they are done.

So if I'm reading this correctly, then on a successful abort, the sense
transaction (if not already completed by the target) should be cancelled as
it will never complete, correct ?


Yes.  More precisely, once the response IU comes back for the abort,
there should be no more IUs for that task.  Figure 13 (Multiple Command
Example) in the UAS spec actually shows that.

At least, that's what it should be like in theory.  I suspect firmware
bugs will abound in this area, but at least you shouldn't be actively
expecting an IU for an aborted task.


Just a note on this.  The abort area in all devices is where we have
scope for spec compliance issues.  Also in the old days abort itself
could trigger a firmware crash on some devices (including arrays).  The
problem is actually that the system is usually groaning because it's out
of memory within the controller.  That actually means that sending in
another task (the abort) causes greater pressure.  For this reason, some
device drivers choose to skip the abort step and head straight to reset.
For reset, you guarantee that all outstanding tasks for the unit are
killed.


Hmm, I like this idea, given the finickiness of the abort path in the uas
driver, and that:

1) We really have no proper way to test this
2) We already have some known issues there (we don't kill sense urbs atm,
and I've a note somewhere about a double free on some corner case
where an urb submit fails)
3)
3) In all the cases where I've managed to trip op an uas device the only
thing which actually worked to recover things was doing a usb reset
4) Aborts should not happen in the first place, so using a big hammer
when they do is not really a big problem, instead we should focus
on figuring out why they happen and fix the cause

I think that just dropping abort handling altogether is a good idea actually,
so if there are no objections I'm just going to do that.

I can simply not set eh_abort_handler (aka set it to NULL), right ?


Just so you know, if you do this, error handling becomes much more
painful.  The abort handler can now handle aborting and retrying without
pausing the host.  The reset definitely stops the host and causes a big
and noticeable burp in processing.  However, there are hosts which have
to do it.


I understand, but shouldn't aborts be something which rarely happens.
I guess that with a faulty drive they happen more often, but at this
point in time the uas driver's error handling problems are mostly with
dealing with timeouts during probing, which are likely caused by
the device not grokking some command we've send, and responding to
this in a bad manner (hanging mostly).

So I'm tempted to just remove the abort handling, and first focus on
getting uas stable under all conditions. Once it is stable we can
look into making it deal more graciously with errors.





###

What about a logical unit reset ? Is that worth trying first? Or is that
likely to just trip up more firmware bugs (uas == usb == cheap == lots of
those) ?

Going straight to an usb device reset would significantly simplify the code,
but is not really friendly toward other usb drivers for a multi-function
uas device (of which I've seen no so far, but they are bound to happen).

So I guess that trying a logical unit reset first would probably be a good
idea.


Yes, that's the eh_device_reset_handler() callback.


I know that it works in non error conditions as I've inserted one (in my
local tree only) in the probe path to test task management functions in
general. It has never really been tested under error conditions since the
uas code can only run one task management function at a time (there is 1
tag reserved for task management functions)


Wait, could this also be your abort problem?  In the running abort
handler, we could get a scenario where we abort a bunch of commands at
the same time.


I don't think so, the uas code does not return from eh_abort_handler
until the abort has either succeeded or timedout (for which a 3 second
timeout is used). AFAIK the scsi core will always issue multiple aborts
sequentially, right. And if the abort succeeds then the tag is free
again for the next abort. Only if an abort fails (times out, which is
what I've seen in all the error conditions which I've encountered so far),

Re: Debugging scsi abort handling ?

2014-08-25 Thread Paolo Bonzini

Il 23/08/2014 16:52, Hans de Goede ha scritto:
 Hi All,
 
 Now that the UAS driver is no longer marked as CONFIG_BROKEN,
 I'm getting quite a few bug reports about issues with UAS drives.
 
 One if the issues is that there might be a number of bugs in the
 abort handling path, as I don't think that was ever tested properly.
 
 So I'm wondering is there a way to test the abort path with a real
 drive? E.G. submit some command which is known to take a significant
 amount of time, and then abort it right after submitting ?

You could have some luck with QEMU's UAS emulation.  If you set QEMU's
I/O throttling options low enough (e.g. 100KB/sec), and then use dd with
iflag=direct and a largish block size (a few MBs), the guest should
abort its I/O soon enough.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-25 Thread Hans de Goede

Hi,

On 08/25/2014 09:20 AM, Paolo Bonzini wrote:
 Il 23/08/2014 16:52, Hans de Goede ha scritto:
 Hi All,

 Now that the UAS driver is no longer marked as CONFIG_BROKEN,
 I'm getting quite a few bug reports about issues with UAS drives.

 One if the issues is that there might be a number of bugs in the
 abort handling path, as I don't think that was ever tested properly.

 So I'm wondering is there a way to test the abort path with a real
 drive? E.G. submit some command which is known to take a significant
 amount of time, and then abort it right after submitting ?
 
 You could have some luck with QEMU's UAS emulation.  If you set QEMU's
 I/O throttling options low enough (e.g. 100KB/sec), and then use dd with
 iflag=direct and a largish block size (a few MBs), the guest should
 abort its I/O soon enough.

Thanks for the suggestion, that is an interesting idea. The problem is
though, that qemu's uas implementation is modeled after the kernel
uas driver (including some bugs which I've ended up fixing at both ends),

I want to see how real hardware deals with abort commands (e.g. does it
only acknowledge the abort, or does it also sends a sense code for
the actual command).

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-25 Thread Bart Van Assche

On 08/25/14 10:47, Hans de Goede wrote:
 I want to see how real hardware deals with abort commands (e.g. does it
 only acknowledge the abort, or does it also sends a sense code for
 the actual command).

The SCSI specs define whether a reply should be sent if a SCSI command
has been aborted. From SAM-5: 5.3.1 Status codes [ ... ] TASK ABORTED.
This status shall be returned when a command is aborted by a command or
task management function on another I_T nexus and the Control mode page
TAS bit is set to one (see 5.6).

5.6 Aborting commands - A command is aborted when a SCSI device
condition (see 6.3), command, or task management function causes
termination of the command prior to its completion by the device server.
After a command is aborted and TASK ABORTED status, if any, is returned,
the SCSI target device shall send no further requests or responses for
that command.

From SPC-4: 7.5.8 Control mode page [ ... ] A task aborted status (TAS)
bit set to zero specifies that aborted commands shall be terminated by
the device server without any response to the application client. A TAS
bit set to one specifies that commands aborted by the actions of an I_T
nexus other than the I_T nexus on which the command was received shall
be completed with TASK ABORTED status (see SAM-5).

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-25 Thread Paolo Bonzini

Il 25/08/2014 12:28, Bart Van Assche ha scritto:
 
 From SPC-4: 7.5.8 Control mode page [ ... ] A task aborted status (TAS)
 bit set to zero specifies that aborted commands shall be terminated by
 the device server without any response to the application client. A TAS
 bit set to one specifies that commands aborted by the actions of an I_T
 nexus other than the I_T nexus on which the command was received shall
 be completed with TASK ABORTED status (see SAM-5).

Note the aborted by the actions of an I_T nexus other than the I_T
nexus on which the command was received.

In practice, this means that TASK ABORTED should only happen if you use
the CLEAR TASK SET tmf and TST is not set to 001b (i.e. _not_ to per
I_T nexus) in the Control mode page.  It should never happen for a pen
drive.

Setting TASK ABORTED aside, the important part is that an abort can do
one of two things:

- complete the command, and then eh_abort should return after the driver
has noticed the completion and called the -scsi_done callback for the
Scsi_Cmnd*.

- abort the command, and then the driver should never call the
-scsi_done callback for the Scsi_Cmnd*.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-25 Thread Hans de Goede

Hi,

On 08/25/2014 01:15 PM, Paolo Bonzini wrote:
 Il 25/08/2014 12:28, Bart Van Assche ha scritto:

 From SPC-4: 7.5.8 Control mode page [ ... ] A task aborted status (TAS)
 bit set to zero specifies that aborted commands shall be terminated by
 the device server without any response to the application client. A TAS
 bit set to one specifies that commands aborted by the actions of an I_T
 nexus other than the I_T nexus on which the command was received shall
 be completed with TASK ABORTED status (see SAM-5).
 
 Note the aborted by the actions of an I_T nexus other than the I_T
 nexus on which the command was received.
 
 In practice, this means that TASK ABORTED should only happen if you use
 the CLEAR TASK SET tmf and TST is not set to 001b (i.e. _not_ to per
 I_T nexus) in the Control mode page.  It should never happen for a pen
 drive.
 
 Setting TASK ABORTED aside, the important part is that an abort can do
 one of two things:
 
 - complete the command, and then eh_abort should return after the driver
 has noticed the completion and called the -scsi_done callback for the
 Scsi_Cmnd*.
 
 - abort the command, and then the driver should never call the
 -scsi_done callback for the Scsi_Cmnd*.

Thanks Bart and Paolo, your insights into this are greatly appreciated.

So with uas there are separate usb transaction for cmd, data in, data out
and sense for each tag. At the time of abort, usually one of data in / data
out and a sense usb transaction will be outstanding.

There already is logic in the driver to kill the data in / out transactions
if a sense gets returned (usually with an error) before they are done.

So if I'm reading this correctly, then on a successful abort, the sense
transaction (if not already completed by the target) should be cancelled as
it will never complete, correct ?

I wish the uas spec contained some more details on this, but it is very
vague wrt task management (well it is vague in general, task management just
is extra hard to test).

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-25 Thread James Bottomley

On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:
 Il 25/08/2014 13:26, Hans de Goede ha scritto:
  Thanks Bart and Paolo, your insights into this are greatly appreciated.
  
  So with uas there are separate usb transaction for cmd, data in, data out
  and sense for each tag. At the time of abort, usually one of data in / data
  out and a sense usb transaction will be outstanding.
  
  There already is logic in the driver to kill the data in / out transactions
  if a sense gets returned (usually with an error) before they are done.
  
  So if I'm reading this correctly, then on a successful abort, the sense
  transaction (if not already completed by the target) should be cancelled as
  it will never complete, correct ?
 
 Yes.  More precisely, once the response IU comes back for the abort,
 there should be no more IUs for that task.  Figure 13 (Multiple Command
 Example) in the UAS spec actually shows that.
 
 At least, that's what it should be like in theory.  I suspect firmware
 bugs will abound in this area, but at least you shouldn't be actively
 expecting an IU for an aborted task.

Just a note on this.  The abort area in all devices is where we have
scope for spec compliance issues.  Also in the old days abort itself
could trigger a firmware crash on some devices (including arrays).  The
problem is actually that the system is usually groaning because it's out
of memory within the controller.  That actually means that sending in
another task (the abort) causes greater pressure.  For this reason, some
device drivers choose to skip the abort step and head straight to reset.
For reset, you guarantee that all outstanding tasks for the unit are
killed.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-24 Thread Hans de Goede

Hi,

On 08/23/2014 05:42 PM, Douglas Gilbert wrote:
 On 14-08-23 10:52 AM, Hans de Goede wrote:
 Hi All,

 Now that the UAS driver is no longer marked as CONFIG_BROKEN,
 I'm getting quite a few bug reports about issues with UAS drives.

 One if the issues is that there might be a number of bugs in the
 abort handling path, as I don't think that was ever tested properly.

 So I'm wondering is there a way to test the abort path with a real
 drive? E.G. submit some command which is known to take a significant
 amount of time, and then abort it right after submitting ?
 
 Hi,
 You might experiment with starting a streaming read
 from one shell (e.g. with dd or ddpt) and while that
 is underway, from another shell issue a
 'sg_reset --device' and see what happens. If nothing then
 try 'sg_reset --target'.

 Aborting a single, long duration command from the user
 space (and only that command, assuming other processes might
 be using that disk) is not easy to do.

Thanks, I'll give sg_reset a try.

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-24 Thread Hans de Goede

Hi,

On 08/23/2014 11:05 PM, James Bottomley wrote:
 On Sat, 2014-08-23 at 16:52 +0200, Hans de Goede wrote:
 Hi All,

 Now that the UAS driver is no longer marked as CONFIG_BROKEN,
 I'm getting quite a few bug reports about issues with UAS drives.

 One if the issues is that there might be a number of bugs in the
 abort handling path, as I don't think that was ever tested properly.
 
 Can you report the actual bugs and we'll try to take a look at them?

To be clear I believe there may be a bug or 2 in the uas.c abort code
paths, not in the scsi core or sd drivers.

But getting more eyes on these definitely makes sense. Should I CC
linux-scsi@vger on issues like this, or should I get the users
to file a bug at bugzilla.kernel.org (my own preference would be to
do the latter, as that keeps all info in a single place).

 
 So I'm wondering is there a way to test the abort path with a real
 drive? E.G. submit some command which is known to take a significant
 amount of time, and then abort it right after submitting ?
 
 This scenario can't really happen under the current eh, if by abort
 path, you mean the path where we abort the command by sending an abort
 TMF in error handling.

Yes that is the one I mean, some users of uas are seeing the 30 second
timeout kick in, and then most of the times the uas abort code does not
seem to actually abort, and a device reset is needed to resolve things.

This could mean 1 of 2 things:

1) the abort code in uas.c is no good
2) the device has actually locked up / crashed

Sometimes though the uas abort code does not just timeout on the abort,
and instead seems to go down in flames (kernel page fault), which seems
to indicate that even if 2 is the case here, that we still have an issue
in the uas abort code.

 The reason is that the command must timeout
 before we abort.  If you mean the path where the driver says it aborted
 the command and we have to retry, you can test that by returning
 DID_ABORT immediately in the queuecommand routine ... I use this to test
 some of the EH properties.  What you want to do is to modify the
 queuecommand to return abort on a small number of commands (say around
 5%) and then try normal operation.  This is what I used to test our
 submission and resubmission routines, but I haven't run it for a year or
 so.
 
 The final suggestion is that you need to make sure this patch is in
 their environment:
 
 commit c69e6f812bab0d5442b40e2f1bfbca48d40bc50b
 Author: James Bottomley jbottom...@parallels.com
 Date:   Thu Apr 10 13:36:11 2014 -0700
 
 [SCSI] More USB deadlock fixes
 
 The reason this may make a difference is that USB appears fragile to
 issuing commands before you complete a reset.

uas is only enabled in 3.15 and newer so that patch should be present.

Regards,

Hans

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-24 Thread Christoph Hellwig

On Sun, Aug 24, 2014 at 10:46:57AM +0200, Hans de Goede wrote:
 To be clear I believe there may be a bug or 2 in the uas.c abort code
 paths, not in the scsi core or sd drivers.
 
 But getting more eyes on these definitely makes sense. Should I CC
 linux-scsi@vger on issues like this, or should I get the users
 to file a bug at bugzilla.kernel.org (my own preference would be to
 do the latter, as that keeps all info in a single place).

Please Cc linux-scsi.  I and some others definitively don't have the
pain and patience to deal with bugzilla.

 uas is only enabled in 3.15 and newer so that patch should be present.

There have been a few more important fixes since 3.15:

ac61d19559349e205dad7b5122b281419aa74a82 scsi: set correct completion code in 
scsi_send_eh_cmnd()
8922a908908ff947f1f211e07e2e97f1169ad3cb scsi_error: fix invalid setting of 
host byte
a33c070bced8b283e22e8dbae35177a033b810bf scsi_error: set DID_TIME_OUT correctly

make sure you have those as well.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Debugging scsi abort handling ?

2014-08-23 Thread Hans de Goede

Hi All,

Now that the UAS driver is no longer marked as CONFIG_BROKEN,
I'm getting quite a few bug reports about issues with UAS drives.

One if the issues is that there might be a number of bugs in the
abort handling path, as I don't think that was ever tested properly.

So I'm wondering is there a way to test the abort path with a real
drive? E.G. submit some command which is known to take a significant
amount of time, and then abort it right after submitting ?

Thanks  Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-23 Thread Douglas Gilbert


On 14-08-23 10:52 AM, Hans de Goede wrote:

Hi All,

Now that the UAS driver is no longer marked as CONFIG_BROKEN,
I'm getting quite a few bug reports about issues with UAS drives.

One if the issues is that there might be a number of bugs in the
abort handling path, as I don't think that was ever tested properly.

So I'm wondering is there a way to test the abort path with a real
drive? E.G. submit some command which is known to take a significant
amount of time, and then abort it right after submitting ?


Hi,
You might experiment with starting a streaming read
from one shell (e.g. with dd or ddpt) and while that
is underway, from another shell issue a
'sg_reset --device' and see what happens. If nothing then
try 'sg_reset --target'.

Aborting a single, long duration command from the user
space (and only that command, assuming other processes might
be using that disk) is not easy to do.

Doug Gilbert


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

2014-08-23 Thread James Bottomley

On Sat, 2014-08-23 at 16:52 +0200, Hans de Goede wrote:
 Hi All,
 
 Now that the UAS driver is no longer marked as CONFIG_BROKEN,
 I'm getting quite a few bug reports about issues with UAS drives.
 
 One if the issues is that there might be a number of bugs in the
 abort handling path, as I don't think that was ever tested properly.

Can you report the actual bugs and we'll try to take a look at them?

 So I'm wondering is there a way to test the abort path with a real
 drive? E.G. submit some command which is known to take a significant
 amount of time, and then abort it right after submitting ?

This scenario can't really happen under the current eh, if by abort
path, you mean the path where we abort the command by sending an abort
TMF in error handling.  The reason is that the command must timeout
before we abort.  If you mean the path where the driver says it aborted
the command and we have to retry, you can test that by returning
DID_ABORT immediately in the queuecommand routine ... I use this to test
some of the EH properties.  What you want to do is to modify the
queuecommand to return abort on a small number of commands (say around
5%) and then try normal operation.  This is what I used to test our
submission and resubmission routines, but I haven't run it for a year or
so.

The final suggestion is that you need to make sure this patch is in
their environment:

commit c69e6f812bab0d5442b40e2f1bfbca48d40bc50b
Author: James Bottomley jbottom...@parallels.com
Date:   Thu Apr 10 13:36:11 2014 -0700

[SCSI] More USB deadlock fixes

The reason this may make a difference is that USB appears fragile to
issuing commands before you complete a reset.

James


 Thanks  Regards,
 
 Hans
 --
 To unsubscribe from this list: send the line unsubscribe linux-scsi in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

39 matches

Mail list logo