Re: question regarding offlined device

2009-02-20 Thread Ulrich Windl

On 19 Feb 2009 at 9:48, Jesse Butler wrote:

 
 
 I am trying to troubleshoot why a connection is popping up and down,  
 and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER  
 initiator.

Hi!

Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that are 
reachable over 16 paths each, and the networ connections use two dedicated 
separate VLANs. As it's a test at the moment, there's only one initiator and 
two 
iSCSCI gateways connected to that switch. And there is no traffic (the 
filesystem 
is a RAID1 that is not mounted). Still I'm seeing message like this:

Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after recovery (
1 attempts)
Feb 19 01:39:51 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:32.
Feb 19 01:39:51 rkdvmso1 multipathd: sdc: tur checker reports path is down
Feb 19 01:39:51 rkdvmso1 multipathd: checker failed path 8:32 in map EVA-2_iSCSI
-1
Feb 19 01:39:51 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11
Feb 19 01:39:56 rkdvmso1 multipathd: sdc: tur checker reports path is up
Feb 19 01:39:56 rkdvmso1 multipathd: 8:32: reinstated
Feb 19 01:39:56 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12
Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after recovery (
1 attempts)
Feb 19 07:49:30 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 65:48
.
Feb 19 07:49:30 rkdvmso1 multipathd: sdt: tur checker reports path is down
Feb 19 07:49:30 rkdvmso1 multipathd: checker failed path 65:48 in map EVA-1_iSCS
I-1
Feb 19 07:49:30 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 11
Feb 19 07:49:35 rkdvmso1 multipathd: sdt: tur checker reports path is up
Feb 19 07:49:35 rkdvmso1 multipathd: 65:48: reinstated
Feb 19 07:49:35 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 12
Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after recovery (
1 attempts)
Feb 19 08:53:16 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:192
.
Feb 19 08:53:16 rkdvmso1 multipathd: sdm: tur checker reports path is down
Feb 19 08:53:16 rkdvmso1 multipathd: checker failed path 8:192 in map EVA-2_iSCS
I-1
Feb 19 08:53:16 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11
Feb 19 08:53:21 rkdvmso1 multipathd: sdm: tur checker reports path is up
Feb 19 08:53:21 rkdvmso1 multipathd: 8:192: reinstated
Feb 19 08:53:21 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12
Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after recovery (
1 attempts)
Feb 19 14:09:41 rkdvmso1 multipathd: sdd: tur checker reports path is down
Feb 19 14:09:41 rkdvmso1 multipathd: checker failed path 8:48 in map EVA-2_iSCSI
-1
Feb 19 14:09:41 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:48.
Feb 19 14:09:46 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11
Feb 19 14:09:46 rkdvmso1 multipathd: sdd: tur checker reports path is up
Feb 19 14:09:46 rkdvmso1 multipathd: 8:48: reinstated
Feb 19 14:09:46 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12
Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after recovery (
1 attempts)
Feb 19 19:33:43 rkdvmso1 multipathd: sdi: tur checker reports path is down
Feb 19 19:33:43 rkdvmso1 multipathd: checker failed path 8:128 in map EVA-1_iSCS
I-1
Feb 19 19:33:43 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:128
.
Feb 19 19:33:49 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 11
Feb 19 19:33:49 rkdvmso1 multipathd: sdi: tur checker reports path is up
Feb 19 19:33:49 rkdvmso1 multipathd: 8:128: reinstated
Feb 19 19:33:49 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 12
Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after recovery (
1 attempts)
Feb 20 01:43:57 rkdvmso1 multipathd: sdc: tur checker reports path is down
Feb 20 01:43:57 rkdvmso1 multipathd: checker failed path 8:32 in map EVA-2_iSCSI
-1
Feb 20 01:43:57 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:32.
Feb 20 01:43:57 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11
Feb 20 01:44:02 rkdvmso1 multipathd: sdc: tur checker reports path is up
Feb 20 01:44:02 rkdvmso1 multipathd: 8:32: reinstated
Feb 20 01:44:02 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12
[...]

I have an odd feeling about that.

Regards,
Ulrich


 
 I see various references to host reset, and finally one looks like  
 the following. It says it succeeded, but this time rather than IO  
 continuing, I see the Device offlined - not ready after error  
 recovery.
 
 Do we have any idea what is happening here based upon this console  
 output?  What is host reset meant to do, and can we tell how it  
 failed?
 
 Thanks
 /jb
 
 
 
 Feb  7 11:20:05 nws-bur-25-48 kernel: iser:  
 iscsi_iser_ib_conn_lookup:no conn exists for
 eph Feb  7 11:20:05 nws-bur-25-48 kernel: iser:  
 iser_connect:connecting to: 192.168.1.5, port
   0xbc0c
 Feb  7 

Re: question regarding offlined device

2009-02-20 Thread Konrad Rzeszutek

 Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that are 
 reachable over 16 paths each, and the networ connections use two dedicated 
 separate VLANs. As it's a test at the moment, there's only one initiator and 
 two 
 iSCSCI gateways connected to that switch. And there is no traffic (the 
 filesystem 
 is a RAID1 that is not mounted). Still I'm seeing message like this:
 
 Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after 
 recovery (
 1 attempts)
 Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after 
 recovery (
 1 attempts)
 Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after 
 recovery (
 1 attempts)
 Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after 
 recovery (
 1 attempts)
 Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after 
 recovery (
 1 attempts)
 Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after 
 recovery (
 1 attempts)

It looks as if each path gets bumped. Is there anything related in the target 
around those times? Did you see a pattern over a long time of which path and 
what
time they are being bumped?

Also, are there any backup being done over that time-frame that would saturate
the switches?

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: question regarding offlined device

2009-02-20 Thread Ulrich Windl

On 20 Feb 2009 at 10:19, Konrad Rzeszutek wrote:

 
  Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that 
  are 
  reachable over 16 paths each, and the networ connections use two dedicated 
  separate VLANs. As it's a test at the moment, there's only one initiator 
  and two 
  iSCSCI gateways connected to that switch. And there is no traffic (the 
  filesystem 
  is a RAID1 that is not mounted). Still I'm seeing message like this:
  
  Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after 
  recovery (
  1 attempts)
  Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after 
  recovery (
  1 attempts)
  Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after 
  recovery (
  1 attempts)
  Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after 
  recovery (
  1 attempts)
  Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after 
  recovery (
  1 attempts)
  Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after 
  recovery (
  1 attempts)
 
 It looks as if each path gets bumped. Is there anything related in the target 
 around those times? Did you see a pattern over a long time of which path and 
 what
 time they are being bumped?
 
 Also, are there any backup being done over that time-frame that would saturate
 the switches?

Hi!

I'll have to investigate a bit deeper. If it's not the network, it could be FC 
or 
the target hardware itself. There are some backups actually running, some of 
those 
I'm responsible for, and others I'm not responsible for, but they use the same 
infrastructure, and even others that only use the FC ISL (Inter-Switch-Links) 
for 
data transport. However we were believing that FC guarantees maximum delays and 
no 
packet drops, so I dobt that it's the FC, but who knows? I'll have a deeper 
look, 
comparing it to the backup times I know...

Regards,
Ulrich



 
  



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: question regarding offlined device

2009-02-20 Thread Konrad Rzeszutek

On Fri, Feb 20, 2009 at 05:07:49PM +0100, Ulrich Windl wrote:
 
 On 20 Feb 2009 at 10:19, Konrad Rzeszutek wrote:
 
  
   Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that 
   are 
   reachable over 16 paths each, and the networ connections use two 
   dedicated 
   separate VLANs. As it's a test at the moment, there's only one initiator 
   and two 
   iSCSCI gateways connected to that switch. And there is no traffic (the 
   filesystem 
   is a RAID1 that is not mounted). Still I'm seeing message like this:
   
   Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after 
   recovery (
   1 attempts)
   Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after 
   recovery (
   1 attempts)
   Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after 
   recovery (
   1 attempts)
   Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after 
   recovery (
   1 attempts)
   Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after 
   recovery (
   1 attempts)
   Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after 
   recovery (
   1 attempts)
  
  It looks as if each path gets bumped. Is there anything related in the 
  target 
  around those times? Did you see a pattern over a long time of which path 
  and what
  time they are being bumped?
  
  Also, are there any backup being done over that time-frame that would 
  saturate
  the switches?
 
 Hi!
 
 I'll have to investigate a bit deeper. If it's not the network, it could be 
 FC or 
 the target hardware itself. There are some backups actually running, some of 
 those 

Those are iSCSI errors. How does FC figure in this? Is FC used for backups? If 
so, then
that shouldn't be an issue as the FC and network fabric are seperate.

 I'm responsible for, and others I'm not responsible for, but they use the 
 same 
 infrastructure, and even others that only use the FC ISL (Inter-Switch-Links) 
 for 
 data transport. However we were believing that FC guarantees maximum delays 
 and no 
 packet drops, so I dobt that it's the FC, but who knows? I'll have a deeper 
 look, 

You will get errors when the FC packets get dropped and can't be re-sent.

 comparing it to the backup times I know...
 
 Regards,
 Ulrich
 
 
 
  
   
 
 
 
 

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: question regarding offlined device

2009-02-20 Thread Jesse Butler

thanks very much, mike.  gets me onto the right path, i think...

On Feb 19, 2009, at 12:19 PM, Mike Christie wrote:


 /iJesse Butler wrote:

 I am trying to troubleshoot why a connection is popping up and down,
 and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER
 initiator.

 I see various references to host reset, and finally one looks like
 the following. It says it succeeded, but this time rather than IO
 continuing, I see the Device offlined - not ready after error
 recovery.

 Do we have any idea what is happening here based upon this console
 output?  What is host reset meant to do, and can we tell how it
 failed?


 Each scsi command has a timeout. You can see it in
 /sys/block/sdX/device/timeout. If a command does not complete with  
 that
 time, the scsi layer fires it's error handler, which basically asks  
 the
 driver to:

 1. abort the taask.
 2. if 1 fails, reset the lu
 3. if 2 fails, reset the bus (iscsi does not do this).
 4. if 3 fails, reset the host.

 (in newer kernels there is a 2.5 where you can reset the target).

 Software iscsi has a weird implementation where it does a host per
 session, and for the host reset we just logout the session and try to
 log in. We should to a target reset, but we do not currently due to  
 bugs.

 If we get to #4 and that fails then the scsi layer will offline the  
 devices.

 If any of 1-4 is successful in fixing the problem, the scsi layer will
 send some commands to test it out. It will normally send a TUR. If
 eventually get to #4 and the reset succeeds but the TUR fails, then
 devices will be offlined.

 So for some reason

 1. commands are taking a long time and are timing out. I think the
 default in RHEL is 60 seconds.

 2. For some reason the transport seems fine. We can login and out.

 3. For some reason the TUR to test the reset is failing.

 If you do not have a scsi disk you can enable lots of scsi layer
 debugging by doing

 echo -1   /proc/sys/dev/scsi/logging_level

 if you have other scsi or data disks in the system you probably want
 less debugging or it will be a mess.

 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



question regarding offlined device

2009-02-19 Thread Jesse Butler


I am trying to troubleshoot why a connection is popping up and down,  
and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER  
initiator.

I see various references to host reset, and finally one looks like  
the following. It says it succeeded, but this time rather than IO  
continuing, I see the Device offlined - not ready after error  
recovery.

Do we have any idea what is happening here based upon this console  
output?  What is host reset meant to do, and can we tell how it  
failed?

Thanks
/jb



Feb  7 11:20:05 nws-bur-25-48 kernel: iser:  
iscsi_iser_ib_conn_lookup:no conn exists for
eph Feb  7 11:20:05 nws-bur-25-48 kernel: iser:  
iser_connect:connecting to: 192.168.1.5, port
  0xbc0c
Feb  7 11:20:05 nws-bur-25-48 kernel: iser: iser_cma_handler:event 0  
conn f6f5f640 id f3a81c00
Feb  7 11:20:05 nws-bur-25-48 kernel: iser: iser_cma_handler:event 2  
conn f6f5f640 id f3a81c00
Feb  7 11:20:05 nws-bur-25-48 kernel: iser:  
iser_create_ib_conn_res:setting conn f6f5f640
  cma_id f3a81c00: fmr_pool f6f5f740 qp f5d22380
Feb  7 11:20:06 nws-bur-25-48 kernel: iser: iscsi_iser_ep_poll:ib conn  
f6f5f640 rc = 0
Feb  7 11:20:06 nws-bur-25-48 kernel: iser: iser_cma_handler:event 9  
conn f6f5f640 id f3a
81c00
Feb  7 11:20:06 nws-bur-25-48 kernel: iser: iscsi_iser_ep_poll:ib conn  
f6f5f640 rc = 1
Feb  7 11:20:06 nws-bur-25-48 kernel: iser:  
iscsi_iser_conn_bind:binding iscsi conn f34a8
16c to iser_conn f6f5f640
Feb  7 11:20:09 nws-bur-25-48 kernel:  session1: host reset succeeded
Feb  7 11:20:10 nws-bur-25-48 iscsid: connection1:0 is operational  
after recovery (1 atte
mpts)
Feb  7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: scsi: Device  
offlined - not ready after
  error recovery
Feb  7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: SCSI error: return  
code = 0x0002
Feb  7 11:20:29 nws-bur-25-48 kernel: end_request: I/O error, dev sdc,  
sector 1058848
Feb  7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: rejecting I/O to  
offline device
Feb  7 11:20:29 nws-bur-25-48 last message repeated 4 times
Feb  7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: SCSI error: return  
code = 0x0001
Feb  7 11:20:29 nws-bur-25-48 kernel: end_request: I/O error, dev sdc,  
sector 1056800
Feb  7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: rejecting I/O to  
offline device
Feb  7 11:20:31 nws-bur-25-48 last message repeated 370 times
Feb  7 19:31:43 nws-bur-25-48 restorecond: Will not restore a file  
with more than one
hard linkd f3a81c00
Feb  7 11:20:05 nws-bur-25-48 kernel: iser:d 9:0:0:0: rejecting I/O to  
offline device
Feb  7 11:20:29 nws-bur-25-48 last message repeated 4 times
Feb  7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: SCSI error: return  
code =
0x0001
Fe f6f5f640 id f3a81c00
Feb  7 11:20:06 nws-bur-25-48 kernel: iser: iscsi_iser_ep_poll:ib conn  
f6f5f640 rc =
1
Feb  7 11:20:06 9ws-bur-25-48 kernel: iser:  
iscsi_iser_conn_bind:binding iscsi conn
f34a816c to iser_conn f6f5f640
Feb  7 11:20:09 nws-bur-25r-onn f34a816c to iser_conn f6f5f640
Feb  7 11:20:09 nws-bur-25 hur-25-48 iscsid:05 nws-bur-25-48 kernel:  
iser(/etc/resolv.conf) Invalid argument


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: question regarding offlined device

2009-02-19 Thread Mike Christie

/iJesse Butler wrote:
 
 I am trying to troubleshoot why a connection is popping up and down,  
 and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER  
 initiator.
 
 I see various references to host reset, and finally one looks like  
 the following. It says it succeeded, but this time rather than IO  
 continuing, I see the Device offlined - not ready after error  
 recovery.
 
 Do we have any idea what is happening here based upon this console  
 output?  What is host reset meant to do, and can we tell how it  
 failed?
 

Each scsi command has a timeout. You can see it in 
/sys/block/sdX/device/timeout. If a command does not complete with that 
time, the scsi layer fires it's error handler, which basically asks the 
driver to:

1. abort the taask.
2. if 1 fails, reset the lu
3. if 2 fails, reset the bus (iscsi does not do this).
4. if 3 fails, reset the host.

(in newer kernels there is a 2.5 where you can reset the target).

Software iscsi has a weird implementation where it does a host per 
session, and for the host reset we just logout the session and try to 
log in. We should to a target reset, but we do not currently due to bugs.

If we get to #4 and that fails then the scsi layer will offline the devices.

If any of 1-4 is successful in fixing the problem, the scsi layer will 
send some commands to test it out. It will normally send a TUR. If 
eventually get to #4 and the reset succeeds but the TUR fails, then 
devices will be offlined.

So for some reason

1. commands are taking a long time and are timing out. I think the 
default in RHEL is 60 seconds.

2. For some reason the transport seems fine. We can login and out.

3. For some reason the TUR to test the reset is failing.

If you do not have a scsi disk you can enable lots of scsi layer 
debugging by doing

echo -1   /proc/sys/dev/scsi/logging_level

if you have other scsi or data disks in the system you probably want 
less debugging or it will be a mess.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---