Device not ready after error recovery?
Can anyone tell me why the SCSI layer says the device is not ready when iscsiadm reports it is logged in? Can I manually online the device? How should I recover from here? Is this a known problem, and has it been fixed in newer open-iscsi versions? Mar 18 18:21:33 eq1-vz2 kernel: connection1:0: detected conn error (1011) Mar 18 18:21:36 eq1-vz2 kernel: session1: host reset succeeded Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined - not ready after error recovery Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined - not ready after error recovery Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: SCSI error: return code = 0x0002 Mar 18 18:22:16 eq1-vz2 kernel: end_request: I/O error, dev sdc, sector 523643177 Mar 18 18:22:16 eq1-vz2 kernel: device-mapper: multipath: Failing path 8:32. Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: SCSI error: return code = 0x0001 Mar 18 18:22:16 eq1-vz2 kernel: end_request: I/O error, dev sdc, sector 552260889 ... snip - more I/O error messages ... $ sudo iscsiadm -m session -P3 iSCSI Transport Class version 2.0-869 iscsiadm version 2.0-869 Target: iqn.1986-03.com.sun:02:271d5722-0206-6ad0-fe1f-d44007068ec4 Current Portal: 10.0.15.0:3260,1 Persistent Portal: 10.0.15.0:3260,1 ** Interface: ** Iface Name: iface.bond0 Iface Transport: tcp Iface Initiatorname: iqn.2005-03.com.equest:eq1-vz2 Iface IPaddress: 10.0.10.1 Iface HWaddress: default Iface Netdev: bond0 SID: 1 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE Negotiated iSCSI params: HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 131072 MaxXmitDataSegmentLength: 131072 FirstBurstLength: 262144 MaxBurstLength: 16776192 ImmediateData: Yes InitialR2T: Yes MaxOutstandingR2T: 1 Attached SCSI devices: Host Number: 6 State: running scsi6 Channel 00 Id 0 Lun: 0 Attached scsi disk sdc State: offline 201151723430d2a0048d003dddm-3 SUN,SOLARIS [size=300G][features=1 queue_if_no_path][hwhandler=0] \_ round-robin 0 [prio=0][enabled] \_ 6:0:0:0 sdc 8:32 [failed][faulty] $ cat /etc/multipath.conf defaults { default_features 1 queue_if_no_path } devnode_blacklist { devnode ^hd[a-z]$ devnode ^sd[ab]$ devnode ^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]* devnode ^cciss!c[0-9]d[0-9]*[p[0-9]*] } Kernel is custom compiled from 2.6.18 source on Debian 4.0 $ uname -a Linux eq1-vz2 2.6.18-prep-92.1.1.el5.028stab057.2-ovz #1 SMP Mon Aug 25 16:43:00 MDT 2008 x86_64 GNU/Linux The open-iscsi tools and module were compiled by hand as well. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: About Intel I/OAT support in open-iscsi
Mike Christie wrote: I am not sure what you are asking. Did you look at the patches? I gave it quick look before asking, I looked now again and I think to understand this better - lets see if I'm in the correct direction: The I/OAT related kernel code serves the TCP stack for coping data from the skb to the consumer buffer - for user space consumers this is fairly simple and done behind the kernel cover. For kernel consumers such as iscsi - things go a bit complex as there's richer API and some flavors of it may not be applicable to use the dma functionality - or the other way around - the dma functionality has to be exported/used by the network stack consumer - namely iscsi, nfs, gfs, etc. So this patch set does changes both in the network internals/api and in iscsi logic. Or. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: RHEL 5.2 and 5.3 - ISCSI Errors impacting database performance?
Thanks Mike... On Mar 18, 5:45 pm, Mike Christie micha...@cs.wisc.edu wrote: bigcatxjs wrote: Hi, We have encountered this error below. This is the first time I have seen this before; This is with the noop settings set to 0 right? Was this the RHEL 5.3 or 5.2 setup? It is our RHEL 5.3 host. Could you do rpm -q iscsi-initiator-utils Sure... - rpm -q iscsi-initiator-utils; iscsi-initiator-utils-6.2.0.868-0.18.el5 Mar 17 12:40:47 MYHOST53 kernel: Vendor: DataCore Model: SANmelody Rev: DCS Mar 17 12:40:47 MYHOST53 kernel: Type: Direct- Access ANSI SCSI revision: 04 Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte hdwr sectors (21475 MB) Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write back w/ FUA Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte hdwr sectors (21475 MB) Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write back w/ FUA Mar 17 12:40:47 MYHOST53 kernel: sdd: sdd1 Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi disk sdd Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi generic sg2 type 0 Mar 17 12:40:47 MYHOST53 iscsid: received iferror -38 Mar 17 18:21:39 MYHOST53 last message repeated 20 times Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead device It looks like one of the following is happening: 1. were using RHEL 5.2 and the target logged us out or dropped the session and when we tried to login we got what we thought was a fatal error (but may be a transient error) from the target so iscsid destroyed the session. When this happens the devices will be removed and IO to the device will get failed like you see below with the rejecting to dead device. In RHEL 5.3 this should be fixed. We will retry the login error instead of giving up right away. 2. someone ran a iscsiadm logout command. Unlikely, I am the only person working with this host currently. 3. iscsid bugged out and killed the session. I do not think this happens because I see below for the session4 (connection4:0) we get an error and end up logging back in so iscsid is up and running. Yes - iscsidm -m session -P3 showed ISCSI as running. BUT the device SDC1; /dev/sdc1 /sandisk1 ext3 _netdev 0 0 It disappeared! in /DEV the SDC re-appeared as SDD!! So I needed to update our FSTAB to /dev/sdd1 /sandisk1 ext3 _netdev 0 0 ... then remount the volume as /sandisk1, then log-out and log-back into ISCSI. On our prod boxes (such as the RHEL 5.2 box) we use LABELS. But if it is #1, it makes me think maybe the target is dropping the session or logging is out. This would explain some nops timing out or failing or the conn failures in the other logs and below. Was there anything in the target logs at this time? Maybe something about a protocol error or something about rebalancing IO or was there anything going on on the target like a firmware upgrade? I have checked the logs on the SM node - unfortunately the logs are circular so the history has already been overwritten (own-goal on my part!). I have checked this morning and so far only informational messages (no errors reported). I am afraid I do not know much about these targets. I have never used one. Have you made any requests to the data core people? Do you have a support guy that you can send me a email address for? Even a tech sales guy there or something might be useful to try and find someone. We have support with DataCore Europe and have logged support bundles with them in the past. I am looking to raise a new one shortly. Thanks, Rich. END. Does anyone know anyone there? Mar 17 18:28:04 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead device Mar 17 18:28:04 MYHOST53 kernel: journal_bmap: journal block not found at offset 2616 on sdc1 Mar 17 18:28:04 MYHOST53 kernel: Aborting journal on device sdc1. Mar 17 18:28:04 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead device Mar 17 18:28:04 MYHOST53 kernel: Buffer I/O error on device sdc1, logical block 1545 Mar 17 18:28:04 MYHOST53 kernel: lost page write due to I/O error on sdc1 Mar 17 23:03:40 MYHOST53 kernel: connection4:0: iscsi: detected conn error (1011) Mar 17 23:03:41 MYHOST53 iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3) Mar 17 23:03:44 MYHOST53 iscsid: received iferror -38 Mar 17 23:03:44 MYHOST53 last message repeated 2 times Mar 17 23:03:44 MYHOST53 iscsid: connection4:0 is operational after recovery (1 attempts) Mar 17 23:46:17 MYHOST53 kernel: connection4:0: iscsi: detected conn error (1011) Mar 17 23:46:18 MYHOST53 iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3) Mar 17 23:46:20
Re: Device not ready after error recovery?
dave wrote: Can anyone tell me why the SCSI layer says the device is not ready when iscsiadm reports it is logged in? Can I manually online the device? How should I recover from here? You can do echo running /sys/block/sdX/device/state but you might not want to because the device may not be back. Is this a known problem, and has it been fixed in newer open-iscsi versions? Are you using a older version of the sun target? Mar 18 18:21:33 eq1-vz2 kernel: connection1:0: detected conn error (1011) Mar 18 18:21:36 eq1-vz2 kernel: session1: host reset succeeded When we log back in we tell scsi-ml that we are ok. Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined - not ready after error recovery scsi-ml will send a Test unit ready (TUR) command to check that the device is ready to go. The TUR seems to be failing and so the scsi layer sets the device offline. I think there was some target issue and was fixed in newer ones. If you can easily replicate this then you should take wireshark/ethereal trace and send the trace here so we can see why the TUR failed and make sure it is not our fault before you go to the trouble of updating. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: RHEL 5.2 and 5.3 - ISCSI Errors impacting database performance?
bigcatxjs wrote: Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead device It looks like one of the following is happening: 1. were using RHEL 5.2 and the target logged us out or dropped the session and when we tried to login we got what we thought was a fatal error (but may be a transient error) from the target so iscsid destroyed the session. When this happens the devices will be removed and IO to the device will get failed like you see below with the rejecting to dead device. In RHEL 5.3 this should be fixed. We will retry the login error instead of giving up right away. 2. someone ran a iscsiadm logout command. Unlikely, I am the only person working with this host currently. 3. iscsid bugged out and killed the session. I do not think this happens because I see below for the session4 (connection4:0) we get an error and end up logging back in so iscsid is up and running. Yes - iscsidm -m session -P3 showed ISCSI as running. BUT the device SDC1; /dev/sdc1 /sandisk1 ext3 _netdev 0 0 It disappeared! in /DEV the SDC re-appeared as SDD!! So I needed to update our FSTAB to /dev/sdd1 /sandisk1 ext3 _netdev 0 0 Did this happen automatically. So the reject messages appeared then you saw sdc switch to sdd? Or did you run some iscsiadm command after you saw the reject messages? Did this happen the time you sent the log output for? In the log out put you got sdd here: Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write back w/ FUA Was this where sdd got added after it was sdc? Was sdc this disk: Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead device The thing is that 2:0:0:0 is on a completely different host/session/connection than sdd. In the original mail you only had one session running on the rhel 5.3 box: iscsiadm; iSCSI Transport Class version 2.0-724 iscsiadm version 2.0-868 Target: iqn.2000-08.com.datacore:sm2-3 Current Portal: 172.16.200.9:3260,1 Persistent Portal: 172.16.200.9:3260,1 ** Interface: ** Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.2005-03.com.redhat:01.406e5fd710e2 Iface IPaddress: 172.16.200.69 Iface HWaddress: default Iface Netdev: default SID: 1 iSCSI Connection State: LOGGED IN iSCSI Session State: Unknown Internal iscsid Session State: NO CHANGE Negotiated iSCSI params: HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 131072 MaxXmitDataSegmentLength: 262144 FirstBurstLength: 0 MaxBurstLength: 1048576 ImmediateData: No InitialR2T: Yes MaxOutstandingR2T: 1 Attached SCSI devices: Host Number: 2 State: running scsi2 Channel 00 Id 0 Lun: 0 Attached scsi disk sdc State: running is that still the same? In general disks can get any name from restart to restart of the iscsi service. So during one boot you will get sda and then when you reboot that disk might now be sdb. You should use labels or udev names. For iscsi you normally do not want to use sdX names because we do our login and scanning asynchronously, so the sdX names are pretty random. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Device not ready after error recovery?
On Mar 19, 10:56 am, Mike Christie micha...@cs.wisc.edu wrote: dave wrote: Can anyone tell me why the SCSI layer says the device is not ready when iscsiadm reports it is logged in? Can I manually online the device? How should I recover from here? You can do echo running /sys/block/sdX/device/state but you might not want to because the device may not be back. A disk in the Sun iscsi target server died. When a disk fails in the server, the iscsi target pauses all read/writes for about 1-2 minutes until it marks the disk as faulted, then continues normal operation using the rest of the RAID pool. I had tested this before and dm- multipath with iscsi seemed to work just fine when the iscsi target paused and eventually resumed, so I was just a little surprised this time. Usually I see timing closer to a minute between conn error and recovery... what are the reconnect/recovery timers of open-iscsi for this scenario? Is this a known problem, and has it been fixed in newer open-iscsi versions? Are you using a older version of the sun target? I am. I am running OpenSoalris SXCE build 93, which is about 8 months old. I'll be upgrading this soon. Mar 18 18:21:33 eq1-vz2 kernel: connection1:0: detected conn error (1011) Mar 18 18:21:36 eq1-vz2 kernel: session1: host reset succeeded When we log back in we tell scsi-ml that we are ok. At what level does the connection receive an error and reset (can't log in to target, read/write errors, etc), and what functionality is needed to be considered ok? If the device wasn't really ready to be used again, shouldn't iscsi know this and attempt another recovery? I'm not particularly well versed in iscsi protocol. Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined - not ready after error recovery scsi-ml will send a Test unit ready (TUR) command to check that the device is ready to go. The TUR seems to be failing and so the scsi layer sets the device offline. Is there only one TUR sent? I would have assumed a more robust recovery procedure here. I think there was some target issue and was fixed in newer ones. If you can easily replicate this then you should take wireshark/ethereal trace and send the trace here so we can see why the TUR failed and make sure it is not our fault before you go to the trouble of updating. I'll see what I can do to get a wire trace next time I have an opportunity to intentionally hiccup the iscsi target. Thanks, Mike. -- Dave --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Device not ready after error recovery?
dave wrote: On Mar 19, 10:56 am, Mike Christie micha...@cs.wisc.edu wrote: dave wrote: Can anyone tell me why the SCSI layer says the device is not ready when iscsiadm reports it is logged in? Can I manually online the device? How should I recover from here? You can do echo running /sys/block/sdX/device/state but you might not want to because the device may not be back. A disk in the Sun iscsi target server died. When a disk fails in the server, the iscsi target pauses all read/writes for about 1-2 minutes until it marks the disk as faulted, then continues normal operation using the rest of the RAID pool. I had tested this before and dm- multipath with iscsi seemed to work just fine when the iscsi target paused and eventually resumed, so I was just a little surprised this time. Usually I see timing closer to a minute between conn error and recovery... what are the reconnect/recovery timers of open-iscsi for this scenario? First the scsi command timer would expire. You can see/set this in /sys/block/sdX/device/timeout (there is also a udev rule). This causes the scsi eh to run. That will try to abort the tasks on the device. If that fails we try a lu reset. If that fails we drop the sessions on the host and relogin (that is where the host reset messages comes from). So for a disk failure, we can log back in quickly because the target is fine. The scsi eh will then send a TUR to the device to verify it is back. The TUR would/could then fail quickly like you saw because the disk really is bad. For this when you know the disk is back online then you would want to manually set the state to running. Eventually multipathd will then set the path back online in the mulitpath device. Is this a known problem, and has it been fixed in newer open-iscsi versions? Are you using a older version of the sun target? I am. I am running OpenSoalris SXCE build 93, which is about 8 months old. I'll be upgrading this soon. Mar 18 18:21:33 eq1-vz2 kernel: connection1:0: detected conn error (1011) Mar 18 18:21:36 eq1-vz2 kernel: session1: host reset succeeded When we log back in we tell scsi-ml that we are ok. At what level does the connection receive an error and reset (can't log in to target, read/write errors, etc), and what functionality is needed to be considered ok? If the device wasn't really ready to be used again, shouldn't iscsi know this and attempt another recovery? I'm not particularly well versed in iscsi protocol. iSCSI does not know this and does not really deal with the device. It deals with the connections/session to the target port/portal. So the target seems fine, and so can relog in quickly. The connections are fine and we can send iscsi level IOs like logins and nops to the target and it will respond ok. The target could tell the initiator that it is temporarily unavailable when we try to login again, but if it can allow IO to other disks while this problem on the one bad disk is going on it probably would not want to do this. If the target is returning something in the TUR that indicates that the device is only temporarily gone, then maybe we would want to change the scsi layer so that instead of failing and setting the device offline right away it retries its eh a little later. Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined - not ready after error recovery scsi-ml will send a Test unit ready (TUR) command to check that the device is ready to go. The TUR seems to be failing and so the scsi layer sets the device offline. Is there only one TUR sent? I would have assumed a more robust recovery procedure here. Only a TUR is sent to check if the aborts or resets worked. I think there was some target issue and was fixed in newer ones. If you can easily replicate this then you should take wireshark/ethereal trace and send the trace here so we can see why the TUR failed and make sure it is not our fault before you go to the trouble of updating. I'll see what I can do to get a wire trace next time I have an opportunity to intentionally hiccup the iscsi target. You probably do not need to worry about this. It is working like expected. But if you could get a trace we can see what the TUR is failed with and maybe see if we can add some code so that if the device is telling us it is only a temporary problem then we do not fail right away. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: About Intel I/OAT support in open-iscsi
Or Gerlitz wrote: Mike Christie wrote: I am not sure what you are asking. Did you look at the patches? I gave it quick look before asking, I looked now again and I think to understand this better - lets see if I'm in the correct direction: The I/OAT related kernel code serves the TCP stack for coping data from the skb to the consumer buffer - for user space consumers this is fairly simple and done behind the kernel cover. For kernel consumers such as iscsi - things go a bit complex as there's richer API and some flavors of it may not be applicable to use the dma functionality - or the other way around - the dma functionality has to be exported/used by the network stack consumer - namely iscsi, nfs, gfs, etc. So this patch set Something like that. However, I do not know about other code. It seems like some nfs code uses kernel_recvmsg so it would be fine. Maybe the rpc code uses recvmsg and could be fine but some data transfer code uses a tcp_read_sock and skb_copy_bits and so it would need changes like iscsi. I think gfs2 sends data over block/scsi devices so this is only applicable if it was using tcp sockets and a API that was not converted already for locking data or metadata. Something like nbd looks fine as is. does changes both in the network internals/api and in iscsi logic. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
HOW TO MAKE EASY MONEY IN HOME BASED
HOW TO MAKE EASY MONEY IN HOME BASED ALL TIPS IS HERE FREE JUST GOTHROUGH IT http://worlddtechnology.blogspot.com/ http://worlddtechnology.blogspot.com/ http://worlddtechnology.blogspot.com/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---