Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

2016-04-28 Thread James Bottomley
On Thu, 2016-04-28 at 16:19 +, Knight, Frederick wrote:
> There are multiple possible situations being intermixed in this
> discussion.  First, I assume you're talking only about random access
> devices (if you try transport level error recover on a sequential
> access device - tape or SMR disk - there are lots of additional
> complexities).

Tape figured prominently in the reset discussion.  Resetting beyond the
LUN has the possibility to cause grave impact to long running jobs
(mostly on tapes).

> Failures can occur at multiple places:
> a) Transport layer failures that the transport layer is able to
> detect quickly;
> b) SCSI device layer failures that the transport layer never even
> knows about.
> 
> For (a) there are two competing goals.  If a port drops off the
> fabric and comes back again, should you be able to just recover and
> continue.  But how long do you wait during that drop?  Some devices
> use this technique to "move" a WWPN from one place to another.  The
> port drops from the fabric, and a short time later, shows up again
> (the WWPN moves from one physical port to a different physical port).
> There are FC driver layer timers that define the length of time
> allowed for this operation.  The goal is fast failover, but not too
> fast - because too fast will break this kind of "transparent
> failover".  This timer also allows for the "OH crap, I pulled the
> wrong cable - put it back in; quick" kind of stupid user bug.

I think we already have this sorted out with the dev loss timeout which
is implemented in the transport.  It's the grace period you have before
we act on a path loss.

> For (b) the transport never has a failure.  A LUN (or a group of
> LUNs) have an ALUA transition from one set of ports to a different
> set of ports.  Some of the LUNs on the port continue to work just
> fine, but others enter ALUA TRANSITION state so they can "move" to a
> different part of the hardware.  After the move completes, you now
> have different sets of optimized and non-optimized paths (or possible
> standby, or unavailable).  The transport will never even know this
> happened.  This kind of "failure" is handled by the SCSI layer
> drivers.

OK, so ALUA did come up as well, I just forgot.  Perhaps I should back
off a bit and give the historical reasons why dm became our primary
path failover system.  It's because for the first ~15 years of Linux we
had no separate transport infrastructure in SCSI (and, to be fair, T10
didn't either).  In fact, all scsi drivers implemented their own
variants of transport stuff.  This meant there was intial pressure to
make the transport failover stuff driver specific and the answer to
that was a resounding "hell no!" so dm (and md) became the de-facto
path failover standard because there was nowhere else to put it.  The
transport infrastructure didn't really become mature until 2006-2007,
well after this decision was made.  However, now we have transport
infrastructure the question of whether we can use it for path failover
isn't unreasonable.  If we abstract it correctly, it could become a
library usable by all our current transports, so we might only need a
single implementation.

For ALUA specifically (and other weird ALUA like implementations), the
handling code actually sits in drivers/scsi/device-handler, so it could
also be used by the transport code to make path decisions.  The point
here is that even if we implement path failover at the transport level,
we do have more than the information available that the transport
should strictly know to make the failover decision.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

2016-04-28 Thread Bart Van Assche
Hello Fred,

Your feedback is very useful, but please note that in my e-mail I used
the phrase "transport layer" to refer to the code in the Linux kernel in
which the fast_io_fail_tmo functionality has been implemented. The
following commit message from 10 years ago explains why the
fast_io_fail_tmo and dev_loss_tmo mechanisms have been implemented:

---
commit 0f29b966d60e9a4f5ecff9f3832257b38aea4f13
Author: James Smart <james.sm...@emulex.com>
Date:   Fri Aug 18 17:33:29 2006 -0400

[SCSI] FC transport: Add dev_loss_tmo callbacks, and new fast_io_fail_tmo 
w/ callback

This patch adds the following functionality to the FC transport:

- dev_loss_tmo LLDD callback :
  Called to essentially confirm the deletion of an rport. Thus, it is
  called whenever the dev_loss_tmo fires, or when the rport is deleted
  due to other circumstances (module unload, etc).  It is expected that
  the callback will initiate the termination of any outstanding i/o on
  the rport.

- fast_io_fail_tmo and LLD callback:
  There are some cases where it may take a long while to truly determine
  device loss, but the system is in a multipathing configuration that if
  the i/o was failed quickly (faster than dev_loss_tmo), it could be
  redirected to a different path and completed sooner.

Many thanks to Mike Reed who cleaned up the initial RFC in support
of this post.
---

Bart.

On 04/28/2016 09:19 AM, Knight, Frederick wrote:
> There are multiple possible situations being intermixed in this discussion.
> First, I assume you're talking only about random access devices (if you try
> transport level error recover on a sequential access device - tape or SMR
> disk - there are lots of additional complexities).
> 
> Failures can occur at multiple places:
> a) Transport layer failures that the transport layer is able to detect 
> quickly;
> b) SCSI device layer failures that the transport layer never even knows about.
> 
> For (a) there are two competing goals.  If a port drops off the fabric and
> comes back again, should you be able to just recover and continue.  But how
> long do you wait during that drop?  Some devices use this technique to "move"
> a WWPN from one place to another.  The port drops from the fabric, and a
> short time later, shows up again (the WWPN moves from one physical port to a
> different physical port). There are FC driver layer timers that define the
> length of time allowed for this operation.  The goal is fast failover, but
> not too fast - because too fast will break this kind of "transparent 
> failover".
> This timer also allows for the "OH crap, I pulled the wrong cable - put it
> back in; quick" kind of stupid user bug.
> 
> For (b) the transport never has a failure.  A LUN (or a group of LUNs)
> have an ALUA transition from one set of ports to a different set of ports.
> Some of the LUNs on the port continue to work just fine, but others enter
> ALUA TRANSITION state so they can "move" to a different part of the hardware.
> After the move completes, you now have different sets of optimized and
> non-optimized paths (or possible standby, or unavailable).  The transport
> will never even know this happened.  This kind of "failure" is handled by
> the SCSI layer drivers.
> 
> There are other cases too, but these are the most common.
> 
>   Fred
> 
> -Original Message-
> From: lsf-boun...@lists.linux-foundation.org 
> [mailto:lsf-boun...@lists.linux-foundation.org] On Behalf Of Bart Van Assche
> Sent: Thursday, April 28, 2016 11:54 AM
> To: James Bottomley; Mike Snitzer
> Cc: linux-bl...@vger.kernel.org; l...@lists.linux-foundation.org; 
> device-mapper development; linux-scsi
> Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM
> 
> On 04/28/2016 08:40 AM, James Bottomley wrote:
>> Well, the entire room, that's vendors, users and implementors
>> complained that path failover takes far too long.  I think in their
>> minds this is enough substance to go on.
> 
> The only complaints I heard about path failover taking too long came
> from people working on FC drivers. Aren't SCSI transport layer
> implementations expected to fail I/O after fast_io_fail_tmo expired
> instead of waiting until the SCSI error handler has finished? If so, why
> is it considered an issue that error handling for the FC protocol can
> take very long (hours)?
> 
> Thanks,
> 
> Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

2016-04-28 Thread Laurence Oberman
Hello Folks,

We still suffer from periodic complaints in our large customer base regarding 
the long recovery times for dm-multipath.
Most of the time this is when we have something like a switch back-plane issue 
or an issue where RSCN'S are blocked coming back up the fabric.
Corner cases still bite us often.

Most of the complaints originate from customers for example seeing Oracle 
cluster evictions where during the waiting on the mid-layer all mpath I/O is 
blocked until recovery.

We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but even tuning 
those we have to wait on serial recovery even if we set the timeouts low.

Lately we have been living with
eh_deadline=10
eh_timeout=5
fast_fail_io_tmo=10
leaving default sd timeout at 30s

So this continues to be an issue and I have specific examples using the jammer 
I can provide showing the serial recovery times here.

Thanks

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

- Original Message -
From: "Bart Van Assche" <bart.vanass...@sandisk.com>
To: "James Bottomley" <james.bottom...@hansenpartnership.com>, "Mike Snitzer" 
<snit...@redhat.com>
Cc: linux-bl...@vger.kernel.org, l...@lists.linux-foundation.org, 
"device-mapper development" <dm-de...@redhat.com>, "linux-scsi" 
<linux-scsi@vger.kernel.org>
Sent: Thursday, April 28, 2016 11:53:50 AM
Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 08:40 AM, James Bottomley wrote:
> Well, the entire room, that's vendors, users and implementors
> complained that path failover takes far too long.  I think in their
> minds this is enough substance to go on.

The only complaints I heard about path failover taking too long came 
from people working on FC drivers. Aren't SCSI transport layer 
implementations expected to fail I/O after fast_io_fail_tmo expired 
instead of waiting until the SCSI error handler has finished? If so, why 
is it considered an issue that error handling for the FC protocol can 
take very long (hours)?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Lsf] Notes from the four separate IO track sessions at LSF/MM

2016-04-28 Thread Knight, Frederick
There are multiple possible situations being intermixed in this discussion.  
First, I assume you're talking only about random access devices (if you try 
transport level error recover on a sequential access device - tape or SMR disk 
- there are lots of additional complexities).

Failures can occur at multiple places:
a) Transport layer failures that the transport layer is able to detect quickly;
b) SCSI device layer failures that the transport layer never even knows about.

For (a) there are two competing goals.  If a port drops off the fabric and 
comes back again, should you be able to just recover and continue.  But how 
long do you wait during that drop?  Some devices use this technique to "move" a 
WWPN from one place to another.  The port drops from the fabric, and a short 
time later, shows up again (the WWPN moves from one physical port to a 
different physical port). There are FC driver layer timers that define the 
length of time allowed for this operation.  The goal is fast failover, but not 
too fast - because too fast will break this kind of "transparent failover".  
This timer also allows for the "OH crap, I pulled the wrong cable - put it back 
in; quick" kind of stupid user bug.

For (b) the transport never has a failure.  A LUN (or a group of LUNs) have an 
ALUA transition from one set of ports to a different set of ports.  Some of the 
LUNs on the port continue to work just fine, but others enter ALUA TRANSITION 
state so they can "move" to a different part of the hardware.  After the move 
completes, you now have different sets of optimized and non-optimized paths (or 
possible standby, or unavailable).  The transport will never even know this 
happened.  This kind of "failure" is handled by the SCSI layer drivers.

There are other cases too, but these are the most common.

Fred

-Original Message-
From: lsf-boun...@lists.linux-foundation.org 
[mailto:lsf-boun...@lists.linux-foundation.org] On Behalf Of Bart Van Assche
Sent: Thursday, April 28, 2016 11:54 AM
To: James Bottomley; Mike Snitzer
Cc: linux-bl...@vger.kernel.org; l...@lists.linux-foundation.org; device-mapper 
development; linux-scsi
Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 08:40 AM, James Bottomley wrote:
> Well, the entire room, that's vendors, users and implementors
> complained that path failover takes far too long.  I think in their
> minds this is enough substance to go on.

The only complaints I heard about path failover taking too long came 
from people working on FC drivers. Aren't SCSI transport layer 
implementations expected to fail I/O after fast_io_fail_tmo expired 
instead of waiting until the SCSI error handler has finished? If so, why 
is it considered an issue that error handling for the FC protocol can 
take very long (hours)?

Thanks,

Bart.
___
Lsf mailing list
l...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

2016-04-28 Thread Bart Van Assche

On 04/28/2016 08:40 AM, James Bottomley wrote:

Well, the entire room, that's vendors, users and implementors
complained that path failover takes far too long.  I think in their
minds this is enough substance to go on.


The only complaints I heard about path failover taking too long came 
from people working on FC drivers. Aren't SCSI transport layer 
implementations expected to fail I/O after fast_io_fail_tmo expired 
instead of waiting until the SCSI error handler has finished? If so, why 
is it considered an issue that error handling for the FC protocol can 
take very long (hours)?


Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html