Hi,

On Fri, 2010-08-06 at 15:57 +0100, Hannes Reinecke wrote: 
> Mike Christie wrote:
> > ccing Hannes from suse, because this looks like a SLES only bug.
> > 
> > Hey Hannes,
> > 
> > The user is using Linux 2.6.27 x86 based on SLES + Xen 3.4 (as dom0)
> > running a couple of RHEL 5.5 VMs. The underlying storage for these VMs
> > is iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array.
> > 
> > 
> > On 08/05/2010 02:21 PM, Goncalo Gomes wrote:
> >> I've copied both the messages file from the host goncalog140 and the
> >> patched libiscsi.c. FWIW, I've also included the iscsid.conf. Find these
> >> files in the link below:
> >>
> >> http://promisc.org/iscsi/
> >>
> > 
> > It looks like this chunk from libiscsi.c:iscsi_queuecommand:
> > 
> >         case ISCSI_STATE_FAILED:
> >             reason = FAILURE_SESSION_FAILED;
> >             sc->result = DID_TRANSPORT_DISRUPTED << 16;
> >             break;
> > 
> > is causing IO errors.
> > 
> > You want to use something like DID_IMM_RETRY because it can be a long
> > time between the time the kernel marks the state as ISCSI_STATE_FAILED
> > until we start recovery and properly get all the device queues blocked,
> > so we can exhaust all the retries if we use DID_TRANSPORT_DISRUPTED.
> Yeah, I noticed.
> But the problem is that multipathing will stall during this time,
> ie no failover will occur and I/O will stall. Using DID_TRANSPORT_DISRUPTED
> will circumvent this and we can failover immediately.
> 
> Sadly I got additional bugreports about this so I think I'll have
> to revert it.

I applied and tested the changes Mike Christie suggests. After the LUN
is rebalanced within the array I no longer see the IO errors and it
appears the setup is now resilient to the equallogic LUN failover
process.

I'm attaching the log from the dmesg merely for sanity check purposes,
if anyone cares to take a look?

> I have put some test kernels at
> 
> http://beta.suse.com/private/hare/sles11/iscsi

Do the test kernels in the url above contain the change of
DID_TRANSPORT_DISRUPTED to DID_DIMM_RETRY or is there more to it than
simply changing the result code? If the latter, would you be able to
upload the source rpms or a unified patch containing the changes you are
are staging? I'm looking for a more pallatable way to test them, given I
have no SLES box lying around, but will install one if needs be.

Thanks,
-Goncalo.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

device-mapper: multipath: version 1.0.5 loaded
device-mapper: multipath round-robin: version 1.0.0 loaded
device-mapper: table: 251:1: multipath: error getting device
device-mapper: ioctl: error adding target to table
device-mapper: table: 251:1: multipath: error getting device
device-mapper: ioctl: error adding target to table
Citrix Systems, Inc. -- Private Release Kernel
Private File Disclaimer The private files provided to you contain a preliminary 
code fix. These private files have been created and distributed to you to 
address your specific issue and provide Citrix with the feedback that your 
issue has been resolved or to provide further debugging information. These 
private files have had minimal in-house testing with no regression testing and 
may contain defects.  These private file(s) will only be supported until an 
official Hotfix has been provided or one is publicly available from the Citrix 
web site. Any private files that are provided to you are intended only for the 
use of the individual or entity to which this is addressed and distribution of 
these files or utilities is prohibited. CITRIX MAKES NO REPRESENTATIONS OR 
WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR 
PURPOSE WITH RESPECT TO THE PRIVATE FILES.  THE PRIVATE FILES ARE DELIVERED ON 
AN "AS IS" BASIS. YOU SHALL HAVE THE SOLE RESPONSIBILITY FOR ADEQUATE 
PROTECTION AND BACK-UP OF AN<6>Loading iSCSI transport class v2.0-870.
iscsi: registered transport (tcp)
scsi6 : iSCSI Initiator over TCP/IP
 connection1:0: detected conn error (1011)
scsi 6:0:0:0: Direct-Access     EQLOGIC  100E-00          4.3  PQ: 0 ANSI: 5
sd 6:0:0:0: [sdb] 209725440 512-byte hardware sectors: (107 GB/100 GiB)
sd 6:0:0:0: [sdb] Write Protect is off
sd 6:0:0:0: [sdb] Mode Sense: ad 00 00 00
sd 6:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
sd 6:0:0:0: [sdb] 209725440 512-byte hardware sectors: (107 GB/100 GiB)
sd 6:0:0:0: [sdb] Write Protect is off
sd 6:0:0:0: [sdb] Mode Sense: ad 00 00 00
sd 6:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
 sdb: sdb1
sd 6:0:0:0: [sdb] Attached SCSI disk
sd 6:0:0:0: Attached scsi generic sg1 type 0
tap_backend_changed: backend/tap/1/51712: created thread 9531
tap_blkif_schedule[9531]: starting
device vif1.0 entered promiscuous mode
xenbr0: port 2(vif1.0) entering forwarding state
blktap: event-channel 6
blktap: ring-ref 8
blktap: protocol 1 (x86_32-abi)
blkback: event-channel 7
blkback: ring-ref 9
blkback: protocol 1 (x86_32-abi)
 connection1:0: ping timeout of 15 secs expired, recv timeout 10, last rx 
219697, last ping 220697, now 222197
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 session1: session recovery timed out after 144 secs
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: ping timeout of 15 secs expired, recv timeout 10, last rx 
310947, last ping 311947, now 313447
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: dropping R2T itt 3 in recovery.
 connection1:0: dropping R2T itt 4 in recovery.
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: dropping R2T itt 58 in recovery.
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: dropping R2T itt 63 in recovery.
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: dropping R2T itt 66 in recovery.
 connection1:0: dropping R2T itt 71 in recovery.
 connection1:0: dropping R2T itt 70 in recovery.
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: dropping R2T itt 112 in recovery.
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)
 connection1:0: detected conn error (1011)

Reply via email to