Hi, On Fri, 2010-08-06 at 15:57 +0100, Hannes Reinecke wrote: > Mike Christie wrote: > > ccing Hannes from suse, because this looks like a SLES only bug. > > > > Hey Hannes, > > > > The user is using Linux 2.6.27 x86 based on SLES + Xen 3.4 (as dom0) > > running a couple of RHEL 5.5 VMs. The underlying storage for these VMs > > is iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array. > > > > > > On 08/05/2010 02:21 PM, Goncalo Gomes wrote: > >> I've copied both the messages file from the host goncalog140 and the > >> patched libiscsi.c. FWIW, I've also included the iscsid.conf. Find these > >> files in the link below: > >> > >> http://promisc.org/iscsi/ > >> > > > > It looks like this chunk from libiscsi.c:iscsi_queuecommand: > > > > case ISCSI_STATE_FAILED: > > reason = FAILURE_SESSION_FAILED; > > sc->result = DID_TRANSPORT_DISRUPTED << 16; > > break; > > > > is causing IO errors. > > > > You want to use something like DID_IMM_RETRY because it can be a long > > time between the time the kernel marks the state as ISCSI_STATE_FAILED > > until we start recovery and properly get all the device queues blocked, > > so we can exhaust all the retries if we use DID_TRANSPORT_DISRUPTED. > Yeah, I noticed. > But the problem is that multipathing will stall during this time, > ie no failover will occur and I/O will stall. Using DID_TRANSPORT_DISRUPTED > will circumvent this and we can failover immediately. > > Sadly I got additional bugreports about this so I think I'll have > to revert it.
I applied and tested the changes Mike Christie suggests. After the LUN is rebalanced within the array I no longer see the IO errors and it appears the setup is now resilient to the equallogic LUN failover process. I'm attaching the log from the dmesg merely for sanity check purposes, if anyone cares to take a look? > I have put some test kernels at > > http://beta.suse.com/private/hare/sles11/iscsi Do the test kernels in the url above contain the change of DID_TRANSPORT_DISRUPTED to DID_DIMM_RETRY or is there more to it than simply changing the result code? If the latter, would you be able to upload the source rpms or a unified patch containing the changes you are are staging? I'm looking for a more pallatable way to test them, given I have no SLES box lying around, but will install one if needs be. Thanks, -Goncalo. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
device-mapper: multipath: version 1.0.5 loaded device-mapper: multipath round-robin: version 1.0.0 loaded device-mapper: table: 251:1: multipath: error getting device device-mapper: ioctl: error adding target to table device-mapper: table: 251:1: multipath: error getting device device-mapper: ioctl: error adding target to table Citrix Systems, Inc. -- Private Release Kernel Private File Disclaimer The private files provided to you contain a preliminary code fix. These private files have been created and distributed to you to address your specific issue and provide Citrix with the feedback that your issue has been resolved or to provide further debugging information. These private files have had minimal in-house testing with no regression testing and may contain defects. These private file(s) will only be supported until an official Hotfix has been provided or one is publicly available from the Citrix web site. Any private files that are provided to you are intended only for the use of the individual or entity to which this is addressed and distribution of these files or utilities is prohibited. CITRIX MAKES NO REPRESENTATIONS OR WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE PRIVATE FILES. THE PRIVATE FILES ARE DELIVERED ON AN "AS IS" BASIS. YOU SHALL HAVE THE SOLE RESPONSIBILITY FOR ADEQUATE PROTECTION AND BACK-UP OF AN<6>Loading iSCSI transport class v2.0-870. iscsi: registered transport (tcp) scsi6 : iSCSI Initiator over TCP/IP connection1:0: detected conn error (1011) scsi 6:0:0:0: Direct-Access EQLOGIC 100E-00 4.3 PQ: 0 ANSI: 5 sd 6:0:0:0: [sdb] 209725440 512-byte hardware sectors: (107 GB/100 GiB) sd 6:0:0:0: [sdb] Write Protect is off sd 6:0:0:0: [sdb] Mode Sense: ad 00 00 00 sd 6:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sd 6:0:0:0: [sdb] 209725440 512-byte hardware sectors: (107 GB/100 GiB) sd 6:0:0:0: [sdb] Write Protect is off sd 6:0:0:0: [sdb] Mode Sense: ad 00 00 00 sd 6:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sdb: sdb1 sd 6:0:0:0: [sdb] Attached SCSI disk sd 6:0:0:0: Attached scsi generic sg1 type 0 tap_backend_changed: backend/tap/1/51712: created thread 9531 tap_blkif_schedule[9531]: starting device vif1.0 entered promiscuous mode xenbr0: port 2(vif1.0) entering forwarding state blktap: event-channel 6 blktap: ring-ref 8 blktap: protocol 1 (x86_32-abi) blkback: event-channel 7 blkback: ring-ref 9 blkback: protocol 1 (x86_32-abi) connection1:0: ping timeout of 15 secs expired, recv timeout 10, last rx 219697, last ping 220697, now 222197 connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) session1: session recovery timed out after 144 secs connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: ping timeout of 15 secs expired, recv timeout 10, last rx 310947, last ping 311947, now 313447 connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: dropping R2T itt 3 in recovery. connection1:0: dropping R2T itt 4 in recovery. connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: dropping R2T itt 58 in recovery. connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: dropping R2T itt 63 in recovery. connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: dropping R2T itt 66 in recovery. connection1:0: dropping R2T itt 71 in recovery. connection1:0: dropping R2T itt 70 in recovery. connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: dropping R2T itt 112 in recovery. connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011) connection1:0: detected conn error (1011)