RE: detected conn error (1011)

2010-08-31 Thread Goncalo Gomes
Thanks Hannes and Mike,

Your help has been highly appreciated!

Cheers,
 -Goncalo.

-Original Message-
From: Hannes Reinecke [mailto:h...@suse.de] 
Sent: 31 August 2010 14:43
To: Goncalo Gomes
Cc: Mike Christie; open-iscsi@googlegroups.com; Shantanu Mehendale
Subject: Re: detected conn error (1011)

Goncalo Gomes wrote:
> Hi Hannes,
> 
> Thanks. The Citrix XenServer 5.6 distribution kernel is based on the 2.6.27 
> tree of SLES 11.
> We add a few extra patches specific to Xen,  dom0 integration and some 
> backports from upstream.
> To the best of my knowledge these additions don't touch the iscsi layer, so 
> from the iscsi
> drivers point of view, I believe they are as pristine as the ones in the SuSE 
> kernel and that's
> why we need the patch as the binaries probably will mismatch gcc version 
> and/or the versioning
> that we use e.g 2.6.27.42-0.1.1.xs5.6.0.44.58xen. I do definitely 
> appreciate your
> 'forward thinking' with regards to the issue, though!
> 
I just checked, and the resulting patch is indeed like you proposed:

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 32b30f1..441ca8b 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1336,9 +1336,6 @@ int iscsi_queuecommand(struct scsi_cmnd *sc, void (*done)(
struct scsi_cmnd *))
 */
switch (session->state) {
case ISCSI_STATE_FAILED:
-   reason = FAILURE_SESSION_FAILED;
-   sc->result = DID_TRANSPORT_DISRUPTED << 16;
-   break;
case ISCSI_STATE_IN_RECOVERY:
reason = FAILURE_SESSION_IN_RECOVERY;
sc->result = DID_IMM_RETRY << 16;

HTH,

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



RE: detected conn error (1011)

2010-08-31 Thread Goncalo Gomes
Hi Hannes,

Thanks. The Citrix XenServer 5.6 distribution kernel is based on the 2.6.27 
tree of SLES 11. We add a few extra patches specific to Xen,  dom0 integration 
and some backports from upstream. To the best of my knowledge these additions 
don't touch the iscsi layer, so from the iscsi drivers point of view, I believe 
they are as pristine as the ones in the SuSE kernel and that's why we need the 
patch as the binaries probably will mismatch gcc version and/or the versioning 
that we use e.g 2.6.27.42-0.1.1.xs5.6.0.44.58xen. I do definitely 
appreciate your 'forward thinking' with regards to the issue, though!

Thanks,
 -Goncalo.



-Original Message-
From: Hannes Reinecke [mailto:h...@suse.de] 
Sent: 30 August 2010 15:12
To: Goncalo Gomes
Cc: Mike Christie; open-iscsi@googlegroups.com; Shantanu Mehendale
Subject: Re: detected conn error (1011)

Goncalo Gomes wrote:
> Hi,
> 
> On Fri, 2010-08-06 at 15:57 +0100, Hannes Reinecke wrote: 
>> Mike Christie wrote:
>>> ccing Hannes from suse, because this looks like a SLES only bug.
>>>
>>> Hey Hannes,
>>>
>>> The user is using Linux 2.6.27 x86 based on SLES + Xen 3.4 (as dom0)
>>> running a couple of RHEL 5.5 VMs. The underlying storage for these VMs
>>> is iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array.
>>>
>>>
>>> On 08/05/2010 02:21 PM, Goncalo Gomes wrote:
>>>> I've copied both the messages file from the host goncalog140 and the
>>>> patched libiscsi.c. FWIW, I've also included the iscsid.conf. Find these
>>>> files in the link below:
>>>>
>>>> http://promisc.org/iscsi/
>>>>
>>> It looks like this chunk from libiscsi.c:iscsi_queuecommand:
>>>
>>> case ISCSI_STATE_FAILED:
>>> reason = FAILURE_SESSION_FAILED;
>>> sc->result = DID_TRANSPORT_DISRUPTED << 16;
>>> break;
>>>
>>> is causing IO errors.
>>>
>>> You want to use something like DID_IMM_RETRY because it can be a long
>>> time between the time the kernel marks the state as ISCSI_STATE_FAILED
>>> until we start recovery and properly get all the device queues blocked,
>>> so we can exhaust all the retries if we use DID_TRANSPORT_DISRUPTED.
>> Yeah, I noticed.
>> But the problem is that multipathing will stall during this time,
>> ie no failover will occur and I/O will stall. Using DID_TRANSPORT_DISRUPTED
>> will circumvent this and we can failover immediately.
>>
>> Sadly I got additional bugreports about this so I think I'll have
>> to revert it.
> 
> I applied and tested the changes Mike Christie suggests. After the LUN
> is rebalanced within the array I no longer see the IO errors and it
> appears the setup is now resilient to the equallogic LUN failover
> process.
> 
> I'm attaching the log from the dmesg merely for sanity check purposes,
> if anyone cares to take a look?
> 
>> I have put some test kernels at
>>
>> http://beta.suse.com/private/hare/sles11/iscsi
> 
> Do the test kernels in the url above contain the change of
> DID_TRANSPORT_DISRUPTED to DID_DIMM_RETRY or is there more to it than
> simply changing the result code? If the latter, would you be able to
> upload the source rpms or a unified patch containing the changes you are
> are staging? I'm looking for a more pallatable way to test them, given I
> have no SLES box lying around, but will install one if needs be.
> 
Got me confused. How would you test the patch if not on a SLES box?
Presumably you would have to install the new kernel on the instance
you are planning to run the test on. Which for any sane setup would
have to be a SLES box. In which case you can just use the provided
kernel directly and save you the compilation step.

Am I missing something?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: detected conn error (1011)

2010-08-24 Thread Goncalo Gomes
Hi,

On Fri, 2010-08-06 at 15:57 +0100, Hannes Reinecke wrote: 
> Mike Christie wrote:
> > ccing Hannes from suse, because this looks like a SLES only bug.
> > 
> > Hey Hannes,
> > 
> > The user is using Linux 2.6.27 x86 based on SLES + Xen 3.4 (as dom0)
> > running a couple of RHEL 5.5 VMs. The underlying storage for these VMs
> > is iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array.
> > 
> > 
> > On 08/05/2010 02:21 PM, Goncalo Gomes wrote:
> >> I've copied both the messages file from the host goncalog140 and the
> >> patched libiscsi.c. FWIW, I've also included the iscsid.conf. Find these
> >> files in the link below:
> >>
> >> http://promisc.org/iscsi/
> >>
> > 
> > It looks like this chunk from libiscsi.c:iscsi_queuecommand:
> > 
> > case ISCSI_STATE_FAILED:
> > reason = FAILURE_SESSION_FAILED;
> > sc->result = DID_TRANSPORT_DISRUPTED << 16;
> > break;
> > 
> > is causing IO errors.
> > 
> > You want to use something like DID_IMM_RETRY because it can be a long
> > time between the time the kernel marks the state as ISCSI_STATE_FAILED
> > until we start recovery and properly get all the device queues blocked,
> > so we can exhaust all the retries if we use DID_TRANSPORT_DISRUPTED.
> Yeah, I noticed.
> But the problem is that multipathing will stall during this time,
> ie no failover will occur and I/O will stall. Using DID_TRANSPORT_DISRUPTED
> will circumvent this and we can failover immediately.
> 
> Sadly I got additional bugreports about this so I think I'll have
> to revert it.

I applied and tested the changes Mike Christie suggests. After the LUN
is rebalanced within the array I no longer see the IO errors and it
appears the setup is now resilient to the equallogic LUN failover
process.

I'm attaching the log from the dmesg merely for sanity check purposes,
if anyone cares to take a look?

> I have put some test kernels at
> 
> http://beta.suse.com/private/hare/sles11/iscsi

Do the test kernels in the url above contain the change of
DID_TRANSPORT_DISRUPTED to DID_DIMM_RETRY or is there more to it than
simply changing the result code? If the latter, would you be able to
upload the source rpms or a unified patch containing the changes you are
are staging? I'm looking for a more pallatable way to test them, given I
have no SLES box lying around, but will install one if needs be.

Thanks,
-Goncalo.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

device-mapper: multipath: version 1.0.5 loaded
device-mapper: multipath round-robin: version 1.0.0 loaded
device-mapper: table: 251:1: multipath: error getting device
device-mapper: ioctl: error adding target to table
device-mapper: table: 251:1: multipath: error getting device
device-mapper: ioctl: error adding target to table
Citrix Systems, Inc. -- Private Release Kernel
Private File Disclaimer The private files provided to you contain a preliminary 
code fix. These private files have been created and distributed to you to 
address your specific issue and provide Citrix with the feedback that your 
issue has been resolved or to provide further debugging information. These 
private files have had minimal in-house testing with no regression testing and 
may contain defects.  These private file(s) will only be supported until an 
official Hotfix has been provided or one is publicly available from the Citrix 
web site. Any private files that are provided to you are intended only for the 
use of the individual or entity to which this is addressed and distribution of 
these files or utilities is prohibited. CITRIX MAKES NO REPRESENTATIONS OR 
WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR 
PURPOSE WITH RESPECT TO THE PRIVATE FILES.  THE PRIVATE FILES ARE DELIVERED ON 
AN "AS IS" BASIS. YOU SHALL HAVE THE SOLE RESPONSIBILITY FOR ADEQUATE 
PROTECTION AND BACK-UP OF AN<6>Loading iSCSI transport class v2.0-870.
iscsi: registered transport (tcp)
scsi6 : iSCSI Initiator over TCP/IP
 connection1:0: detected conn error (1011)
scsi 6:0:0:0: Direct-Access EQLOGIC  100E-00  4.3  PQ: 0 ANSI: 5
sd 6:0:0:0: [sdb] 209725440 512-byte hardware sectors: (107 GB/100 GiB)
sd 6:0:0:0: [sdb] Write Protect is off
sd 6:0:0:0: [sdb] Mode Sense: ad 00 00 00
sd 6:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
sd 6:0:0:0: [sdb] 209725440 512-byte hardware sectors: (107

RE: detected conn error (1011)

2010-08-06 Thread Goncalo Gomes
Hi Hannes,

Would you be able to send me a unified patch containing the changes included in 
the test kernels so I can rebuild the drivers with them and update you today?

For completeness, we are not running SLES, but rather the Citrix XenServer 5.6 
release which is based off of the Linux 2.6.27 tree of SLES. Also, for this 
specific controller we don't enable MPIO, but in most other arrays we do.

Thanks,
 -Goncalo.

-Original Message-
From: Hannes Reinecke [mailto:h...@suse.de] 
Sent: 06 August 2010 15:58
To: Mike Christie
Cc: open-iscsi@googlegroups.com; Goncalo Gomes
Subject: Re: detected conn error (1011)

Mike Christie wrote:
> ccing Hannes from suse, because this looks like a SLES only bug.
> 
> Hey Hannes,
> 
> The user is using Linux 2.6.27 x86 based on SLES + Xen 3.4 (as dom0)
> running a couple of RHEL 5.5 VMs. The underlying storage for these VMs
> is iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array.
> 
> 
> On 08/05/2010 02:21 PM, Goncalo Gomes wrote:
>> I've copied both the messages file from the host goncalog140 and the
>> patched libiscsi.c. FWIW, I've also included the iscsid.conf. Find these
>> files in the link below:
>>
>> http://promisc.org/iscsi/
>>
> 
> It looks like this chunk from libiscsi.c:iscsi_queuecommand:
> 
> case ISCSI_STATE_FAILED:
> reason = FAILURE_SESSION_FAILED;
> sc->result = DID_TRANSPORT_DISRUPTED << 16;
> break;
> 
> is causing IO errors.
> 
> You want to use something like DID_IMM_RETRY because it can be a long
> time between the time the kernel marks the state as ISCSI_STATE_FAILED
> until we start recovery and properly get all the device queues blocked,
> so we can exhaust all the retries if we use DID_TRANSPORT_DISRUPTED.
Yeah, I noticed.
But the problem is that multipathing will stall during this time,
ie no failover will occur and I/O will stall. Using DID_TRANSPORT_DISRUPTED
will circumvent this and we can failover immediately.

Sadly I got additional bugreports about this so I think I'll have
to revert it.

I have put some test kernels at

http://beta.suse.com/private/hare/sles11/iscsi

Can you test with them and check if this issue is solved?

Thanks.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: detected conn error (1011)

2010-08-05 Thread Goncalo Gomes
On Wed, 2010-08-04 at 21:51 -0500, Mike Christie wrote:
> conn error 1011 is generic. If this is occurring when the eql box is 
> rebalancing luns, it is a little different than above. With the above 
> problem we did not know why we got the error. With your situation we 
> sort of expect this. We should not be getting disk IO errors though.
> 
> When we get the logout request from the target, we send the logout 
> request, then basically handle the cleanup like if we got a connection 
> error. That is why you would see the conn error msg in this path. This 
> also means if this happened to the same IO 5 times, then you would see 
> the disk IO errors (scsi layer only lets us retry disk IO 5 times). But 
> if it just happened once, then the IO should be retried when we log into 
> the new portal and execute like normal.

What would be the best way to I identify how many retries have elapsed?

> Or are you using dm-multipath over iscsi? In that case you do not get 
> any retries, so we would expect to see that end_request: I/O error 
> message, but dm-multipath should just be retrying a new path or 
> internally queueing for whatever timeout value you had it use in 
> multipath.conf.

Multipath is not enabled at all. The equallogic array is active/passive
and we only have a view into one controller at any time, so we don't
make use of multipath at present.

> Could you send me the libiscsi.c file you patched?
> 
> Could you also send more of the log for either case? I want to see the 
> iscsid log info and any more of the kernel iscsi log info that you have. 
> I am looking for session recovery timed out messages and/or target 
> requested logout messages.

I've copied both the messages file from the host goncalog140 and the
patched libiscsi.c. FWIW, I've also included the iscsid.conf. Find these
files in the link below:

http://promisc.org/iscsi/

N.B: the messages file contains spew from other instrumentation tests
(e.g a dump_stack() call in scsi_transport_iscsi.c::iscsi_conn_error()).
The last set of tests which I've made available yesterday have only the
libiscsi.c and IIRC the iscsi_tcp.c, and this output can be found around
the timeframe of 17:50.

If required I can spin a new set of tests with different instrumentation
and/or collect different information, logs or tcpdumps, if that helps in
any way.

Thanks,
 -Goncalo.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



detected conn error (1011)

2010-08-04 Thread Goncalo Gomes
I'm running a setup composed of: Linux 2.6.27 x86 based on SLES + Xen 3.4 (as 
dom0) running a couple of RHEL 5.5 VMs. The underlying storage for these VMs is 
iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array.



Whenever the equallogic rebalances the LUNs between the controllers/ports, it 
requests the initiator to logout and login again to the new port/ip. If the 
guests are idle, the following messages show up in the logs:



Aug  3 17:55:08 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:09 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:10 goncalog140 iscsid: connection1:0 is operational after recovery 
(1 attempts)



However, if one of the RHEL guests is busy performing IO, we end up having a 
few failed requests as well:



Aug  3 17:55:26 goncalog140 kernel:  connection1:0: dropping R2T itt 55 in 
recovery.

Aug  3 17:55:26 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:26 goncalog140 kernel: sd 6:0:0:0: [sdb] Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK,SUGGEST_OK

Aug  3 17:55:26 goncalog140 kernel: end_request: I/O error, dev sdb, sector 
533399

Aug  3 17:55:26 goncalog140 kernel: sd 6:0:0:0: [sdb] Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK,SUGGEST_OK

Aug  3 17:55:26 goncalog140 kernel: end_request: I/O error, dev sdb, sector 
5337 51

Aug  3 17:55:27 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:29 goncalog140 iscsid: connection1:0 is operational after recovery 
(1 attempts)



And as a side effect, the guest filesystem goes read-only. Googling around, 
I've found the following thread on this list which covers the same error I'm 
seeing in the logs:



http://groups.google.com/group/open-iscsi/browse_thread/thread/3a7a5db6e5020423/8e95febb6cf79f64?lnk=gst&q=conn+error#8e95febb6cf79f64



I've also compiled the drivers iscsi_tcp/libiscsi with the patch from Mike 
Christie taken from that thread which can be found in the link below:



http://groups.google.com/group/open-iscsi/attach/db552832995daaa7/trace-conn-error.patch?part=2&view=1



Is this a known issue? Is there anything else from a troubleshooting 
perspective that I could do?



I've uploaded the following files, in case someone would like to take a look:



Tcpdump's collected a couple of days ago in another reproduction/analysis of 
the same bug (apologies, but I didn't get around to collect new tcp dumps with 
today's reproduction):



0tcpdump0947.pcap   162K  - 09:47 (GMT+1) nothing occurred.

1tcpdump0952.pcap   4.8M  - 09:52 (GMT+2) problem occurred



Logs from today's reproduction of the issue with the patched drivers for 
additional backtracing:



vm-boot.txt2.7K After VM creation

vm-lun-rebalance-no-effect.txt 3.1K VM is idling, FS does not become 
read-only.

vm-lun-rebalance-fs-readonly.txt   3.3K VM is dd'ing /dev/zero to iscsi based 
disk, FS becomes read-only.

guest-dmesg.txt14K  RHEL 5.3 with 2.6.18-194.8.1.el5xen 
(RHEL 5.5 kernel)



All these files can be found in the following link:



http://promisc.org/iscsi/



Any help would be greatly appreciated!



Cheers,

 -Goncalo.




-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.