Re: detected conn error (1011)

Mike Christie Wed, 04 Aug 2010 19:47:39 -0700

On 08/04/2010 04:12 PM, Goncalo Gomes wrote:

I'm running a setup composed of: Linux 2.6.27 x86 based on SLES + Xen 3.4 (as 
dom0) running a couple of RHEL 5.5 VMs. The underlying storage for these VMs is 
iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array.




Whenever the equallogic rebalances the LUNs between the controllers/ports, it 
requests the initiator to logout and login again to the new port/ip. If the 
guests are idle, the following messages show up in the logs:



Aug  3 17:55:08 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:09 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:10 goncalog140 iscsid: connection1:0 is operational after recovery 
(1 attempts)



However, if one of the RHEL guests is busy performing IO, we end up having a 
few failed requests as well:



Aug  3 17:55:26 goncalog140 kernel:  connection1:0: dropping R2T itt 55 in 
recovery.

Aug  3 17:55:26 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:26 goncalog140 kernel: sd 6:0:0:0: [sdb] Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK,SUGGEST_OK

Aug  3 17:55:26 goncalog140 kernel: end_request: I/O error, dev sdb, sector 
533399

Aug  3 17:55:26 goncalog140 kernel: sd 6:0:0:0: [sdb] Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK,SUGGEST_OK

Aug  3 17:55:26 goncalog140 kernel: end_request: I/O error, dev sdb, sector 
5337 51

Aug  3 17:55:27 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:29 goncalog140 iscsid: connection1:0 is operational after recovery 
(1 attempts)



And as a side effect, the guest filesystem goes read-only. Googling around, 
I've found the following thread on this list which covers the same error I'm 
seeing in the logs:



http://groups.google.com/group/open-iscsi/browse_thread/thread/3a7a5db6e5020423/8e95febb6cf79f64?lnk=gst&q=conn+error#8e95febb6cf79f64

conn error 1011 is generic. If this is occurring when the eql box isrebalancing luns, it is a little different than above. With the aboveproblem we did not know why we got the error. With your situation wesort of expect this. We should not be getting disk IO errors though.

When we get the logout request from the target, we send the logoutrequest, then basically handle the cleanup like if we got a connectionerror. That is why you would see the conn error msg in this path. Thisalso means if this happened to the same IO 5 times, then you would seethe disk IO errors (scsi layer only lets us retry disk IO 5 times). Butif it just happened once, then the IO should be retried when we log intothe new portal and execute like normal.

Or are you using dm-multipath over iscsi? In that case you do not getany retries, so we would expect to see that end_request: I/O errormessage, but dm-multipath should just be retrying a new path orinternally queueing for whatever timeout value you had it use inmultipath.conf.



I've also compiled the drivers iscsi_tcp/libiscsi with the patch from Mike 
Christie taken from that thread which can be found in the link below:



http://groups.google.com/group/open-iscsi/attach/db552832995daaa7/trace-conn-error.patch?part=2&view=1


Could you send me the libiscsi.c file you patched?

Could you also send more of the log for either case? I want to see theiscsid log info and any more of the kernel iscsi log info that you have.I am looking for session recovery timed out messages and/or targetrequested logout messages.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: detected conn error (1011)

Reply via email to