Re: question regarding offlined device
On 19 Feb 2009 at 9:48, Jesse Butler wrote: I am trying to troubleshoot why a connection is popping up and down, and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER initiator. Hi! Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that are reachable over 16 paths each, and the networ connections use two dedicated separate VLANs. As it's a test at the moment, there's only one initiator and two iSCSCI gateways connected to that switch. And there is no traffic (the filesystem is a RAID1 that is not mounted). Still I'm seeing message like this: Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after recovery ( 1 attempts) Feb 19 01:39:51 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:32. Feb 19 01:39:51 rkdvmso1 multipathd: sdc: tur checker reports path is down Feb 19 01:39:51 rkdvmso1 multipathd: checker failed path 8:32 in map EVA-2_iSCSI -1 Feb 19 01:39:51 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11 Feb 19 01:39:56 rkdvmso1 multipathd: sdc: tur checker reports path is up Feb 19 01:39:56 rkdvmso1 multipathd: 8:32: reinstated Feb 19 01:39:56 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12 Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after recovery ( 1 attempts) Feb 19 07:49:30 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 65:48 . Feb 19 07:49:30 rkdvmso1 multipathd: sdt: tur checker reports path is down Feb 19 07:49:30 rkdvmso1 multipathd: checker failed path 65:48 in map EVA-1_iSCS I-1 Feb 19 07:49:30 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 11 Feb 19 07:49:35 rkdvmso1 multipathd: sdt: tur checker reports path is up Feb 19 07:49:35 rkdvmso1 multipathd: 65:48: reinstated Feb 19 07:49:35 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 12 Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after recovery ( 1 attempts) Feb 19 08:53:16 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:192 . Feb 19 08:53:16 rkdvmso1 multipathd: sdm: tur checker reports path is down Feb 19 08:53:16 rkdvmso1 multipathd: checker failed path 8:192 in map EVA-2_iSCS I-1 Feb 19 08:53:16 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11 Feb 19 08:53:21 rkdvmso1 multipathd: sdm: tur checker reports path is up Feb 19 08:53:21 rkdvmso1 multipathd: 8:192: reinstated Feb 19 08:53:21 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12 Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after recovery ( 1 attempts) Feb 19 14:09:41 rkdvmso1 multipathd: sdd: tur checker reports path is down Feb 19 14:09:41 rkdvmso1 multipathd: checker failed path 8:48 in map EVA-2_iSCSI -1 Feb 19 14:09:41 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:48. Feb 19 14:09:46 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11 Feb 19 14:09:46 rkdvmso1 multipathd: sdd: tur checker reports path is up Feb 19 14:09:46 rkdvmso1 multipathd: 8:48: reinstated Feb 19 14:09:46 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12 Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after recovery ( 1 attempts) Feb 19 19:33:43 rkdvmso1 multipathd: sdi: tur checker reports path is down Feb 19 19:33:43 rkdvmso1 multipathd: checker failed path 8:128 in map EVA-1_iSCS I-1 Feb 19 19:33:43 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:128 . Feb 19 19:33:49 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 11 Feb 19 19:33:49 rkdvmso1 multipathd: sdi: tur checker reports path is up Feb 19 19:33:49 rkdvmso1 multipathd: 8:128: reinstated Feb 19 19:33:49 rkdvmso1 multipathd: EVA-1_iSCSI-1: remaining active paths: 12 Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after recovery ( 1 attempts) Feb 20 01:43:57 rkdvmso1 multipathd: sdc: tur checker reports path is down Feb 20 01:43:57 rkdvmso1 multipathd: checker failed path 8:32 in map EVA-2_iSCSI -1 Feb 20 01:43:57 rkdvmso1 kernel: device-mapper: dm-multipath: Failing path 8:32. Feb 20 01:43:57 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 11 Feb 20 01:44:02 rkdvmso1 multipathd: sdc: tur checker reports path is up Feb 20 01:44:02 rkdvmso1 multipathd: 8:32: reinstated Feb 20 01:44:02 rkdvmso1 multipathd: EVA-2_iSCSI-1: remaining active paths: 12 [...] I have an odd feeling about that. Regards, Ulrich I see various references to host reset, and finally one looks like the following. It says it succeeded, but this time rather than IO continuing, I see the Device offlined - not ready after error recovery. Do we have any idea what is happening here based upon this console output? What is host reset meant to do, and can we tell how it failed? Thanks /jb Feb 7 11:20:05 nws-bur-25-48 kernel: iser: iscsi_iser_ib_conn_lookup:no conn exists for eph Feb 7 11:20:05 nws-bur-25-48 kernel: iser: iser_connect:connecting to: 192.168.1.5, port 0xbc0c Feb 7
Re: question regarding offlined device
Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that are reachable over 16 paths each, and the networ connections use two dedicated separate VLANs. As it's a test at the moment, there's only one initiator and two iSCSCI gateways connected to that switch. And there is no traffic (the filesystem is a RAID1 that is not mounted). Still I'm seeing message like this: Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after recovery ( 1 attempts) Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after recovery ( 1 attempts) Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after recovery ( 1 attempts) Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after recovery ( 1 attempts) Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after recovery ( 1 attempts) Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after recovery ( 1 attempts) It looks as if each path gets bumped. Is there anything related in the target around those times? Did you see a pattern over a long time of which path and what time they are being bumped? Also, are there any backup being done over that time-frame that would saturate the switches? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: question regarding offlined device
On 20 Feb 2009 at 10:19, Konrad Rzeszutek wrote: Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that are reachable over 16 paths each, and the networ connections use two dedicated separate VLANs. As it's a test at the moment, there's only one initiator and two iSCSCI gateways connected to that switch. And there is no traffic (the filesystem is a RAID1 that is not mounted). Still I'm seeing message like this: Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after recovery ( 1 attempts) Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after recovery ( 1 attempts) Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after recovery ( 1 attempts) Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after recovery ( 1 attempts) Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after recovery ( 1 attempts) Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after recovery ( 1 attempts) It looks as if each path gets bumped. Is there anything related in the target around those times? Did you see a pattern over a long time of which path and what time they are being bumped? Also, are there any backup being done over that time-frame that would saturate the switches? Hi! I'll have to investigate a bit deeper. If it's not the network, it could be FC or the target hardware itself. There are some backups actually running, some of those I'm responsible for, and others I'm not responsible for, but they use the same infrastructure, and even others that only use the FC ISL (Inter-Switch-Links) for data transport. However we were believing that FC guarantees maximum delays and no packet drops, so I dobt that it's the FC, but who knows? I'll have a deeper look, comparing it to the backup times I know... Regards, Ulrich --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: question regarding offlined device
On Fri, Feb 20, 2009 at 05:07:49PM +0100, Ulrich Windl wrote: On 20 Feb 2009 at 10:19, Konrad Rzeszutek wrote: Maybe that's related: With SLES10 SP1 (x86_64) I have two iSCSI LUNs that are reachable over 16 paths each, and the networ connections use two dedicated separate VLANs. As it's a test at the moment, there's only one initiator and two iSCSCI gateways connected to that switch. And there is no traffic (the filesystem is a RAID1 that is not mounted). Still I'm seeing message like this: Feb 19 01:39:50 rkdvmso1 iscsid: connection198:0 is operational after recovery ( 1 attempts) Feb 19 07:49:30 rkdvmso1 iscsid: connection215:0 is operational after recovery ( 1 attempts) Feb 19 08:53:15 rkdvmso1 iscsid: connection208:0 is operational after recovery ( 1 attempts) Feb 19 14:09:40 rkdvmso1 iscsid: connection211:0 is operational after recovery ( 1 attempts) Feb 19 19:33:42 rkdvmso1 iscsid: connection220:0 is operational after recovery ( 1 attempts) Feb 20 01:43:56 rkdvmso1 iscsid: connection210:0 is operational after recovery ( 1 attempts) It looks as if each path gets bumped. Is there anything related in the target around those times? Did you see a pattern over a long time of which path and what time they are being bumped? Also, are there any backup being done over that time-frame that would saturate the switches? Hi! I'll have to investigate a bit deeper. If it's not the network, it could be FC or the target hardware itself. There are some backups actually running, some of those Those are iSCSI errors. How does FC figure in this? Is FC used for backups? If so, then that shouldn't be an issue as the FC and network fabric are seperate. I'm responsible for, and others I'm not responsible for, but they use the same infrastructure, and even others that only use the FC ISL (Inter-Switch-Links) for data transport. However we were believing that FC guarantees maximum delays and no packet drops, so I dobt that it's the FC, but who knows? I'll have a deeper look, You will get errors when the FC packets get dropped and can't be re-sent. comparing it to the backup times I know... Regards, Ulrich --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: question regarding offlined device
thanks very much, mike. gets me onto the right path, i think... On Feb 19, 2009, at 12:19 PM, Mike Christie wrote: /iJesse Butler wrote: I am trying to troubleshoot why a connection is popping up and down, and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER initiator. I see various references to host reset, and finally one looks like the following. It says it succeeded, but this time rather than IO continuing, I see the Device offlined - not ready after error recovery. Do we have any idea what is happening here based upon this console output? What is host reset meant to do, and can we tell how it failed? Each scsi command has a timeout. You can see it in /sys/block/sdX/device/timeout. If a command does not complete with that time, the scsi layer fires it's error handler, which basically asks the driver to: 1. abort the taask. 2. if 1 fails, reset the lu 3. if 2 fails, reset the bus (iscsi does not do this). 4. if 3 fails, reset the host. (in newer kernels there is a 2.5 where you can reset the target). Software iscsi has a weird implementation where it does a host per session, and for the host reset we just logout the session and try to log in. We should to a target reset, but we do not currently due to bugs. If we get to #4 and that fails then the scsi layer will offline the devices. If any of 1-4 is successful in fixing the problem, the scsi layer will send some commands to test it out. It will normally send a TUR. If eventually get to #4 and the reset succeeds but the TUR fails, then devices will be offlined. So for some reason 1. commands are taking a long time and are timing out. I think the default in RHEL is 60 seconds. 2. For some reason the transport seems fine. We can login and out. 3. For some reason the TUR to test the reset is failing. If you do not have a scsi disk you can enable lots of scsi layer debugging by doing echo -1 /proc/sys/dev/scsi/logging_level if you have other scsi or data disks in the system you probably want less debugging or it will be a mess. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
question regarding offlined device
I am trying to troubleshoot why a connection is popping up and down, and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER initiator. I see various references to host reset, and finally one looks like the following. It says it succeeded, but this time rather than IO continuing, I see the Device offlined - not ready after error recovery. Do we have any idea what is happening here based upon this console output? What is host reset meant to do, and can we tell how it failed? Thanks /jb Feb 7 11:20:05 nws-bur-25-48 kernel: iser: iscsi_iser_ib_conn_lookup:no conn exists for eph Feb 7 11:20:05 nws-bur-25-48 kernel: iser: iser_connect:connecting to: 192.168.1.5, port 0xbc0c Feb 7 11:20:05 nws-bur-25-48 kernel: iser: iser_cma_handler:event 0 conn f6f5f640 id f3a81c00 Feb 7 11:20:05 nws-bur-25-48 kernel: iser: iser_cma_handler:event 2 conn f6f5f640 id f3a81c00 Feb 7 11:20:05 nws-bur-25-48 kernel: iser: iser_create_ib_conn_res:setting conn f6f5f640 cma_id f3a81c00: fmr_pool f6f5f740 qp f5d22380 Feb 7 11:20:06 nws-bur-25-48 kernel: iser: iscsi_iser_ep_poll:ib conn f6f5f640 rc = 0 Feb 7 11:20:06 nws-bur-25-48 kernel: iser: iser_cma_handler:event 9 conn f6f5f640 id f3a 81c00 Feb 7 11:20:06 nws-bur-25-48 kernel: iser: iscsi_iser_ep_poll:ib conn f6f5f640 rc = 1 Feb 7 11:20:06 nws-bur-25-48 kernel: iser: iscsi_iser_conn_bind:binding iscsi conn f34a8 16c to iser_conn f6f5f640 Feb 7 11:20:09 nws-bur-25-48 kernel: session1: host reset succeeded Feb 7 11:20:10 nws-bur-25-48 iscsid: connection1:0 is operational after recovery (1 atte mpts) Feb 7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: scsi: Device offlined - not ready after error recovery Feb 7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: SCSI error: return code = 0x0002 Feb 7 11:20:29 nws-bur-25-48 kernel: end_request: I/O error, dev sdc, sector 1058848 Feb 7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: rejecting I/O to offline device Feb 7 11:20:29 nws-bur-25-48 last message repeated 4 times Feb 7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: SCSI error: return code = 0x0001 Feb 7 11:20:29 nws-bur-25-48 kernel: end_request: I/O error, dev sdc, sector 1056800 Feb 7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: rejecting I/O to offline device Feb 7 11:20:31 nws-bur-25-48 last message repeated 370 times Feb 7 19:31:43 nws-bur-25-48 restorecond: Will not restore a file with more than one hard linkd f3a81c00 Feb 7 11:20:05 nws-bur-25-48 kernel: iser:d 9:0:0:0: rejecting I/O to offline device Feb 7 11:20:29 nws-bur-25-48 last message repeated 4 times Feb 7 11:20:29 nws-bur-25-48 kernel: sd 9:0:0:0: SCSI error: return code = 0x0001 Fe f6f5f640 id f3a81c00 Feb 7 11:20:06 nws-bur-25-48 kernel: iser: iscsi_iser_ep_poll:ib conn f6f5f640 rc = 1 Feb 7 11:20:06 9ws-bur-25-48 kernel: iser: iscsi_iser_conn_bind:binding iscsi conn f34a816c to iser_conn f6f5f640 Feb 7 11:20:09 nws-bur-25r-onn f34a816c to iser_conn f6f5f640 Feb 7 11:20:09 nws-bur-25 hur-25-48 iscsid:05 nws-bur-25-48 kernel: iser(/etc/resolv.conf) Invalid argument --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: question regarding offlined device
/iJesse Butler wrote: I am trying to troubleshoot why a connection is popping up and down, and finally staying down, with a Linux RHEL 5.2 Open iSCSI / iSER initiator. I see various references to host reset, and finally one looks like the following. It says it succeeded, but this time rather than IO continuing, I see the Device offlined - not ready after error recovery. Do we have any idea what is happening here based upon this console output? What is host reset meant to do, and can we tell how it failed? Each scsi command has a timeout. You can see it in /sys/block/sdX/device/timeout. If a command does not complete with that time, the scsi layer fires it's error handler, which basically asks the driver to: 1. abort the taask. 2. if 1 fails, reset the lu 3. if 2 fails, reset the bus (iscsi does not do this). 4. if 3 fails, reset the host. (in newer kernels there is a 2.5 where you can reset the target). Software iscsi has a weird implementation where it does a host per session, and for the host reset we just logout the session and try to log in. We should to a target reset, but we do not currently due to bugs. If we get to #4 and that fails then the scsi layer will offline the devices. If any of 1-4 is successful in fixing the problem, the scsi layer will send some commands to test it out. It will normally send a TUR. If eventually get to #4 and the reset succeeds but the TUR fails, then devices will be offlined. So for some reason 1. commands are taking a long time and are timing out. I think the default in RHEL is 60 seconds. 2. For some reason the transport seems fine. We can login and out. 3. For some reason the TUR to test the reset is failing. If you do not have a scsi disk you can enable lots of scsi layer debugging by doing echo -1 /proc/sys/dev/scsi/logging_level if you have other scsi or data disks in the system you probably want less debugging or it will be a mess. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---