On 08/04/2010 04:12 PM, Goncalo Gomes wrote:
I'm running a setup composed of: Linux 2.6.27 x86 based on SLES + Xen 3.4 (as 
dom0) running a couple of RHEL 5.5 VMs. The underlying storage for these VMs is 
iSCSI based via open-iscsi 2.0.870-26.6.1 and a DELL equallogic array.



Whenever the equallogic rebalances the LUNs between the controllers/ports, it 
requests the initiator to logout and login again to the new port/ip. If the 
guests are idle, the following messages show up in the logs:



Aug  3 17:55:08 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:09 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:10 goncalog140 iscsid: connection1:0 is operational after recovery 
(1 attempts)



However, if one of the RHEL guests is busy performing IO, we end up having a 
few failed requests as well:



Aug  3 17:55:26 goncalog140 kernel:  connection1:0: dropping R2T itt 55 in 
recovery.

Aug  3 17:55:26 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:26 goncalog140 kernel: sd 6:0:0:0: [sdb] Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK,SUGGEST_OK

Aug  3 17:55:26 goncalog140 kernel: end_request: I/O error, dev sdb, sector 
533399

Aug  3 17:55:26 goncalog140 kernel: sd 6:0:0:0: [sdb] Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK,SUGGEST_OK

Aug  3 17:55:26 goncalog140 kernel: end_request: I/O error, dev sdb, sector 
5337 51

Aug  3 17:55:27 goncalog140 kernel:  connection1:0: detected conn error (1011)

Aug  3 17:55:29 goncalog140 iscsid: connection1:0 is operational after recovery 
(1 attempts)



And as a side effect, the guest filesystem goes read-only. Googling around, 
I've found the following thread on this list which covers the same error I'm 
seeing in the logs:



http://groups.google.com/group/open-iscsi/browse_thread/thread/3a7a5db6e5020423/8e95febb6cf79f64?lnk=gst&q=conn+error#8e95febb6cf79f64


conn error 1011 is generic. If this is occurring when the eql box is rebalancing luns, it is a little different than above. With the above problem we did not know why we got the error. With your situation we sort of expect this. We should not be getting disk IO errors though.

When we get the logout request from the target, we send the logout request, then basically handle the cleanup like if we got a connection error. That is why you would see the conn error msg in this path. This also means if this happened to the same IO 5 times, then you would see the disk IO errors (scsi layer only lets us retry disk IO 5 times). But if it just happened once, then the IO should be retried when we log into the new portal and execute like normal.

Or are you using dm-multipath over iscsi? In that case you do not get any retries, so we would expect to see that end_request: I/O error message, but dm-multipath should just be retrying a new path or internally queueing for whatever timeout value you had it use in multipath.conf.





I've also compiled the drivers iscsi_tcp/libiscsi with the patch from Mike 
Christie taken from that thread which can be found in the link below:



http://groups.google.com/group/open-iscsi/attach/db552832995daaa7/trace-conn-error.patch?part=2&view=1



Could you send me the libiscsi.c file you patched?

Could you also send more of the log for either case? I want to see the iscsid log info and any more of the kernel iscsi log info that you have. I am looking for session recovery timed out messages and/or target requested logout messages.

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to