Re: connection1:0 detected conn error (1011) open-iscsi 2.0-871 w/ CentOS 2.6.18-53

2010-07-26 Thread Sean S
Thanks for the patch Mike. Below is the output from a failure when
running with the patch. Any thoughts?

[f8bf0876] iscsi_conn_failure+0x10/0x69 [libiscsi]

[f9bf202d] iscsi_eh_abort+0x2f1/0x406 [libiscsi]

[f885d378] __scsi_try_to_abort_cmd+0x19/0x1a [scsi_mod]

[f885e85d] scsi_error_handler+0x24d/0x422 [scsi_mod]

[c041f7ea] complete+0x2b/0x3d

[f885e610] scsi_error_handler+0x0/0x422 [scsi_mod]

[c0435f65] kthread+0xc0/0xeb

[c0435ea5] kthread+0x0/0xeb

[c0405c3b] kernel_thread_helper+0x7/0x10

===

connection1:0 detected conn error (1011)

[f8bf0876] iscsi_conn_failure+0x10/0x69 [libiscsi]

[f8bf22fc] iscsi_eh_target_reset+0xbb/0x218 [libiscsi]

[c0605967] _spin_lock_bh+0x8/0x18

[f8bf0f78] iscsi_eh_device_reset+0x1c5/0x1cf [libiscsi]

[c054a6dd] get_device+0xe/0x14

[f885d764] scsi_try_host_reset+0x3a/0x99 [scsi_mod]

[f885e0e3] scsi_eh_ready_devs+0x302/0x3e2 [scsi_mod]

[f885e8dd] scsi_error_handler+0x2cd/0x422 [scsi_mod]

[c041f7ea] complete+0x2b/0x3d

[f885e610] scsi_error_handler+0x0/0x422 [scsi_mod]

[c0435f65] kthread+0xc0/0xeb

[c0435ea5] kthread+0x0/0xeb

[c0405c3b] kernel_thread_helper+0x7/0x10

===

session1: session recovery timed out after 400 secs

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: scsi: Device offlined - no ready after error recovery

sd 0:0:0:0: SCSI error: return code = 0x0002

end_request: I/O error, dev sda, sector 14283149

On Jul 13, 10:34 pm, Mike Christie micha...@cs.wisc.edu wrote:
 Could you run with the attached patch? It just prints out a little more
 info. When we get the conn error, it will print out a message if it is
 due to the target dropping the connection and it will print out stack
 trace so we can see exactly what piece of code is throwing the error.

 On 07/13/2010 09:33 PM, Sean S wrote:

  Nothing else in the log from iscsid. No mention of a failed reconnect,
  although the only log I'm really able to access post failure is dmesg.
  Since I'm running a root iscsi, I couldn't get to /var/log/messages
  which maybe was a little more verbose? What sort of network problems

 Yeah, by default the iscsid messages go there. iscsid should be spitting
 out a cannot connect $some_error_value_or_string that would help tell us
 why we cannot reach the target anymore.

  might cause this? The network in this situation is a simple gigE
  switch with about 3 or 4 systems on it. The target and initiator are
  on the same subnet, nothing fancy. Is there some additional debug
  you'd recommend turning on? Any tips or tricks when running with a
  root iscsi drive?

 Not that I can think of at the iscsi layer.



  Curiously, if I physically disconnect the ethernet from the initiator
  while running, all I/O access is correctly paused without returning I/
  O errors. If I then reconnect before the 400s is up things go back to
  normal. I don't however see the detected conn error (1011) message
  in this situation however. Not sure if that really means anything.

 You should see the conn error 1011 message if

 1. you have nops on and they timeout and that causes us to log that error.

 2. the network layer figures out there is a problem and notifies us. It
 is possible that you pull a cable and plug it back in before the network
 throws an error.

 3. iscsi driver or protocol error. In this case we should relogin quickly.

  trace-conn-error.patch
 1KViewDownload

-- 
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: connection1:0 detected conn error (1011) open-iscsi 2.0-871 w/ CentOS 2.6.18-53

2010-07-26 Thread Mike Christie

On 07/26/2010 04:36 PM, Sean S wrote:

Thanks for the patch Mike. Below is the output from a failure when
running with the patch. Any thoughts?

[f8bf0876] iscsi_conn_failure+0x10/0x69 [libiscsi]

[f9bf202d] iscsi_eh_abort+0x2f1/0x406 [libiscsi]

[f885d378] __scsi_try_to_abort_cmd+0x19/0x1a [scsi_mod]

[f885e85d] scsi_error_handler+0x24d/0x422 [scsi_mod]

[c041f7ea] complete+0x2b/0x3d

[f885e610] scsi_error_handler+0x0/0x422 [scsi_mod]

[c0435f65] kthread+0xc0/0xeb

[c0435ea5] kthread+0x0/0xeb

[c0405c3b] kernel_thread_helper+0x7/0x10



Each scsi command has a timeout (see /sys/block/sdX/device/timeout). The 
above dump shows that a scsi command is timing out. This causes the scsi 
layer to have the driver, iscsi_tcp in this case, to try and abort the 
command. It looks like the abort timed out too, and so the iscsi layer 
decided to escalate the eh and failed the iscsi session/connection.




session1: session recovery timed out after 400 secs


The iscsi layer tried to log back in for recovery/replacement timeout 
seconds, but could not.


Did you see anything from iscsid about why it could not log in? iscsid 
writes to /var/log/messages by default.





sd 0:0:0:0: scsi: Device offlined - no ready after error recovery



Because the replacement/recovery timeout fired, the iscsi layer decided 
it was time to give up and tells the scsi layer the disks are not 
recoverable, and so we these messages:




sd 0:0:0:0: scsi: Device offlined - no ready after error recovery



Does the session/connection ever re-login (you would see some message in 
/var/log/messages about connection X:Y is operational after recovery (Z 
attempts)?


On the target box check out /var/log/messages. Is the target even up 
still? Did it segfault?


--
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.