Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Matthew Dickinson wrote: > > > On 11/10/09 11:39 AM, "Mike Christie" wrote: >> What version of open-iscsi were you using and what kernel, and were you >> using the iscsi kernel modules with open-iscsi.org tarball or from the >> kernel? > > iscsi-initiator-utils-6.2.0.871-0.10.el5 > kernel-2.6.18-164.2.1.el5 > > RedHat RPMs > >> >> It looks like we are sending more IO than the target can handle. In one >> of those cases it took more than 30 or 60 seconds (depending on your >> timeout value). >> >> What is the value of >> >> cat /sys/block/sdXYZ/device/timeout >> >> ? >> >> If it is 30 or 60 could you increase it to 360? After you login to the >> target do >> >> echo 360 > /sys/block/sdXYZ/device/timeout > > I've tried setting this, but it appears to have no effect - it was 60, and I > increased to 360. > >> And what is the value of: >> >> iscsiadm -m node -T your_target | grep node.session.cmds_max >> >> If that is 128, then could you decrease that to 32 or 16? >> >> Run >> >> iscsiadm -m node -T your_target -u >> iscsiadm -m node -T your_target -o update -n node.session.cmds_max -v 32 >> iscsiad-m node -T your_target -l > > I've tried setting to both 16 and 32, but it behaves about the same. > >> >> And if those prevent the io errors then could you do >> >> echo noop > /sys/block/sdXYZ/queue/scheduler >> >> to see if performance increases with a difference scheduler. > > > I really think I'm back to the duplicate ACK problem - see the attached > packet dump - at one point there's 30 duplicate ACKs... Interestingly, the I did not get the attachement. > storage has "worked" for the past week - I'm using it as D2D backup. This > morning (about 7 days later), it's giving all these duplicate ACKs. > > I'm currently running into messages such as: > > Nov 19 09:46:58 backup iscsid: Kernel reported iSCSI connection 2:0 error > (1011) state (3) > Nov 19 09:47:00 backup kernel: session2: target reset succeeded If you are using Red Hat RPMs, make a red hat bugzilla https://bugzilla.redhat.com/. CC mchri...@redhat.com on the bugzilla or email me at that address when you have made the bugzilla. I will then add some network people to it. Attach your trace to the bugzilla. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=.
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
On 11/10/09 11:39 AM, "Mike Christie" wrote: > > What version of open-iscsi were you using and what kernel, and were you > using the iscsi kernel modules with open-iscsi.org tarball or from the > kernel? iscsi-initiator-utils-6.2.0.871-0.10.el5 kernel-2.6.18-164.2.1.el5 RedHat RPMs > > > It looks like we are sending more IO than the target can handle. In one > of those cases it took more than 30 or 60 seconds (depending on your > timeout value). > > What is the value of > > cat /sys/block/sdXYZ/device/timeout > > ? > > If it is 30 or 60 could you increase it to 360? After you login to the > target do > > echo 360 > /sys/block/sdXYZ/device/timeout I've tried setting this, but it appears to have no effect - it was 60, and I increased to 360. > > And what is the value of: > > iscsiadm -m node -T your_target | grep node.session.cmds_max > > If that is 128, then could you decrease that to 32 or 16? > > Run > > iscsiadm -m node -T your_target -u > iscsiadm -m node -T your_target -o update -n node.session.cmds_max -v 32 > iscsiad-m node -T your_target -l I've tried setting to both 16 and 32, but it behaves about the same. > > > And if those prevent the io errors then could you do > > echo noop > /sys/block/sdXYZ/queue/scheduler > > to see if performance increases with a difference scheduler. I really think I'm back to the duplicate ACK problem - see the attached packet dump - at one point there's 30 duplicate ACKs... Interestingly, the storage has "worked" for the past week - I'm using it as D2D backup. This morning (about 7 days later), it's giving all these duplicate ACKs. I'm currently running into messages such as: Nov 19 09:46:58 backup iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Nov 19 09:47:00 backup kernel: session2: target reset succeeded Nov 19 09:47:01 backup iscsid: connection2:0 is operational after recovery (1 attempts) Nov 19 09:47:10 backup kernel: sd 9:0:0:0: SCSI error: return code = 0x000e Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdf, sector 8856 Nov 19 09:47:10 backup kernel: device-mapper: multipath: Failing path 8:80. Nov 19 09:47:10 backup kernel: sd 9:0:0:0: SCSI error: return code = 0x000e Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdf, sector 74424 Nov 19 09:47:10 backup kernel: sd 9:0:0:2: SCSI error: return code = 0x000e Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdm, sector 8845240 Nov 19 09:47:10 backup kernel: device-mapper: multipath: Failing path 8:192. Nov 19 09:47:10 backup kernel: sd 9:0:0:2: SCSI error: return code = 0x000e Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdm, sector 62915456 Nov 19 09:47:10 backup kernel: sd 9:0:0:2: timing out command, waited 300s Nov 19 09:47:10 backup multipathd: /sbin/mpath_prio_alua exitted with 1 Nov 19 09:47:10 backup multipathd: error calling out /sbin/mpath_prio_alua /dev/sdm Nov 19 09:47:10 backup multipathd: 3600d0230061d4479bfb83902: switch to path group #2 This is also interesting: Nov 18 01:48:30 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 8 Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 7 Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 6 Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 5 Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 4 Nov 18 20:16:34 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 5 Nov 18 20:32:09 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 6 Nov 18 20:43:05 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 7 Nov 18 20:48:08 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 8 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 7 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 6 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 5 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 4 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 3 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 2 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 1 Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 0 Nov 18 20:53:41 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 1 Nov 18 20:59:09 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 2 Nov 18 21:04:37 backup multipathd: 3600d0230061d4479bfb83902: remaining active paths: 3 Nov 18 21:10:05 back
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
oh... also, the "ovm1" server is the dom0. That system is connecting to IETD targets from within the dom0 using iscsi-initiator-utils-6.2.0.871-0.7. There are two ifaces setup to connect to two portal IP's on my CentOS 5.3, aka "storage", system. The result is this: [r...@ovm1 ~]# iscsiadm -m session tcp: [1] 192.168.30.1:3260,1 iqn.2009-09.com.itec.storage:tgtd tcp: [2] 192.168.30.1:3260,1 iqn.2009-09.com.itec.storage:tgtd tcp: [3] 192.168.30.2:3260,1 iqn.2009-09.com.itec.storage:tgtd tcp: [4] 192.168.30.2:3260,1 iqn.2009-09.com.itec.storage:tgtd In my REAL test and production OVM 2.2 environments, we have 2 ifaces setup to connect to one load balanced Dell EqualLogic group IP, which results in something like this: [r...@oir7102901 log]# iscsiadm -m session | grep lun0 tcp: [2] 192.168.0.19:3260,1 iqn.2001-05.com.equallogic:0-8a0906-781989902-5a900114ae5e-ovm-t-lun0 tcp: [5] 192.168.0.19:3260,1 iqn.2001-05.com.equallogic:0-8a0906-781989902-5a900114ae5e-ovm-t-lun0 [r...@oir7102901 log]# In our current TEST environment (Dell EqualLogic + Dell powerconnect switch running MTU=9126 + Dell R710 with OVM 2.2 installed) we've had some strange server issues (possible hardware problems. I'm still diagnosing this). However, with regards to iSCSI, I haven't seen any issues and can see (3) dt threads going for my ovm-t-lun0, ovm-t-lun1, and ovm-t-lun2 volumes. The overall throughput is roughly only about 70MB/s TOTAL (that includes metrics from all three of those luns) for reads and writes. The iowait% is my bottleneck at the moment though. And we have not had any "iscsi disconnect" messsages. In our PROD2 environment (Dell m610 blades + Cisco switch modules in the blade chassis-- I believe m1000e-- + Dell powerconnect switch + Dell EqualLogic members) we've had absolutely no issues thus far. The dt tests have the same throughput and have been running for about 8-9 days so far without a single disconnect. On Nov 12, 2009, at 10:15 AM, Hoot, Joseph wrote: > > sorry... wrong information. Here is the correct information. I was doing > some testing in VMWare Fusion VM's for a presentation that I'm giving. The > "storage" server is CentOS 5.3, which dishes out IETD targets for my OVM > servers. The OVM 2.2 environment is as follows: > > [r...@ovm1 ~]# uname -r > 2.6.18-128.2.1.4.9.el5xen > [r...@ovm1 ~]# rpm -qa | grep iscsi > iscsi-initiator-utils-6.2.0.871-0.7.el5 > [r...@ovm1 ~]# > > > > On Nov 10, 2009, at 2:30 PM, Mike Christie wrote: > >> >> Hoot, Joseph wrote: >>> [r...@storage ~]# uname -r >>> 2.6.18-164.el5 >>> [r...@storage ~]# rpm -qa | grep iscsi >>> iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 >>> [r...@storage ~]# >>> >> >> Weird. >> >> Is 2.6.18-164.el5 the kernel being used in the virtual machine/DonU? Is >> that where you are using iscsi? It looks like the Oracle enterprise >> linux kernel is 2.6.18-164.el5, which looks like it is based on RHEL >> 5.4. The iscsi code in there is the same as RHEL/upstream. No sendwait >> patch. >> >> However, it looks like there is a 2.6.18-128.2.1.4.9 kernel (comes with >> the Oracle VM rpms). In here we have a different iscsi version. It looks >> a little older than what is in 2.6.18-164.el5, but it has the sendwait >> patch I send to dell. Do you use this kernel in the Dom0? Are you using >> this kernel with iscsi? >> >> >> >>> On Nov 10, 2009, at 12:17 PM, Mike Christie wrote: >>> Hoot, Joseph wrote: > I've had about 3 threads of dt (kicking off a bit randomly) on (3) > separate volumes for over a week and haven't had a single disconnect yet. > I am currently using whatever rpm is distributed with Oracle VM v2.2. > I know for sure that they have included the 871 base, plus I believe at > least a one off patch. I can get more details if you'd like. > > But so far so good for now > I think I have the source they are using. Could you do a uname -r, so I can see what kernel they are using. >>> >>> === >>> Joseph R. Hoot >>> Lead System Programmer/Analyst >>> (w) 716-878-4832 >>> (c) 716-759-HOOT >>> joe.h...@itec.suny.edu >>> GPG KEY: 7145F633 >>> === >>> >>> >> >> >>> > > === > Joseph R. Hoot > Lead System Programmer/Analyst > (w) 716-878-4832 > (c) 716-759-HOOT > joe.h...@itec.suny.edu > GPG KEY: 7145F633 > === > > > > === Joseph R. Hoot Lead System Programmer/Analyst (w) 716-878-4832 (c) 716-759-HOOT joe.h...@itec.suny.edu GPG KEY: 7145F633 === --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit th
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
sorry... wrong information. Here is the correct information. I was doing some testing in VMWare Fusion VM's for a presentation that I'm giving. The "storage" server is CentOS 5.3, which dishes out IETD targets for my OVM servers. The OVM 2.2 environment is as follows: [r...@ovm1 ~]# uname -r 2.6.18-128.2.1.4.9.el5xen [r...@ovm1 ~]# rpm -qa | grep iscsi iscsi-initiator-utils-6.2.0.871-0.7.el5 [r...@ovm1 ~]# On Nov 10, 2009, at 2:30 PM, Mike Christie wrote: > > Hoot, Joseph wrote: >> [r...@storage ~]# uname -r >> 2.6.18-164.el5 >> [r...@storage ~]# rpm -qa | grep iscsi >> iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 >> [r...@storage ~]# >> > > Weird. > > Is 2.6.18-164.el5 the kernel being used in the virtual machine/DonU? Is > that where you are using iscsi? It looks like the Oracle enterprise > linux kernel is 2.6.18-164.el5, which looks like it is based on RHEL > 5.4. The iscsi code in there is the same as RHEL/upstream. No sendwait > patch. > > However, it looks like there is a 2.6.18-128.2.1.4.9 kernel (comes with > the Oracle VM rpms). In here we have a different iscsi version. It looks > a little older than what is in 2.6.18-164.el5, but it has the sendwait > patch I send to dell. Do you use this kernel in the Dom0? Are you using > this kernel with iscsi? > > > >> On Nov 10, 2009, at 12:17 PM, Mike Christie wrote: >> >>> Hoot, Joseph wrote: I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate volumes for over a week and haven't had a single disconnect yet. I am currently using whatever rpm is distributed with Oracle VM v2.2. I know for sure that they have included the 871 base, plus I believe at least a one off patch. I can get more details if you'd like. But so far so good for now >>> I think I have the source they are using. Could you do a uname -r, so I >>> can see what kernel they are using. >>> >> >> === >> Joseph R. Hoot >> Lead System Programmer/Analyst >> (w) 716-878-4832 >> (c) 716-759-HOOT >> joe.h...@itec.suny.edu >> GPG KEY: 7145F633 >> === >> >> >>> > > > > === Joseph R. Hoot Lead System Programmer/Analyst (w) 716-878-4832 (c) 716-759-HOOT joe.h...@itec.suny.edu GPG KEY: 7145F633 === --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Hoot, Joseph wrote: > [r...@storage ~]# uname -r > 2.6.18-164.el5 > [r...@storage ~]# rpm -qa | grep iscsi > iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 > [r...@storage ~]# > Weird. Is 2.6.18-164.el5 the kernel being used in the virtual machine/DonU? Is that where you are using iscsi? It looks like the Oracle enterprise linux kernel is 2.6.18-164.el5, which looks like it is based on RHEL 5.4. The iscsi code in there is the same as RHEL/upstream. No sendwait patch. However, it looks like there is a 2.6.18-128.2.1.4.9 kernel (comes with the Oracle VM rpms). In here we have a different iscsi version. It looks a little older than what is in 2.6.18-164.el5, but it has the sendwait patch I send to dell. Do you use this kernel in the Dom0? Are you using this kernel with iscsi? > On Nov 10, 2009, at 12:17 PM, Mike Christie wrote: > >> Hoot, Joseph wrote: >>> I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate >>> volumes for over a week and haven't had a single disconnect yet. I am >>> currently using whatever rpm is distributed with Oracle VM v2.2. I know >>> for sure that they have included the 871 base, plus I believe at least a >>> one off patch. I can get more details if you'd like. >>> >>> But so far so good for now >>> >> I think I have the source they are using. Could you do a uname -r, so I >> can see what kernel they are using. >> > > === > Joseph R. Hoot > Lead System Programmer/Analyst > (w) 716-878-4832 > (c) 716-759-HOOT > joe.h...@itec.suny.edu > GPG KEY: 7145F633 > === > > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Matthew Dickinson wrote: > On 11/6/09 3:39 PM, "Matthew Dickinson" wrote: > >> >>> Try disabling nops by setting >>> >>> node.conn[0].timeo.noop_out_interval = 0 >>> node.conn[0].timeo.noop_out_timeout = 0 > > I'm still getting errors: > > Nov 10 09:08:04 backup kernel: connection12:0: detected conn error (1011) > Nov 10 09:08:05 backup iscsid: Kernel reported iSCSI connection 12:0 error > (1011) state (3) > Nov 10 09:08:08 backup iscsid: connection12:0 is operational after recovery > (1 attempts) > Nov 10 09:09:43 backup kernel: connection11:0: detected conn error (1011) > Nov 10 09:09:43 backup kernel: connection12:0: detected conn error (1011) > Nov 10 09:09:44 backup kernel: connection11:0: detected conn error (1011) > Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 11:0 error > (1011) state (3) > Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 12:0 error > (1011) state (3) > Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 11:0 error > (1011) state (1) > Nov 10 09:09:46 backup kernel: session11: target reset succeeded\ > Nov 10 09:09:47 backup iscsid: connection11:0 is operational after recovery > (1 attempts) > Nov 10 09:09:47 backup iscsid: connection12:0 is operational after recovery > (1 attempts) > Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code = > 0x000e > Nov 10 09:09:56 backup kernel: end_request: I/O error, dev sdv, sector > 60721248 > Nov 10 09:09:56 backup kernel: device-mapper: multipath: Failing path 65:80. > Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code = > 0x000e > Nov 10 09:09:56 backup kernel: end_request: I/O error, dev sdv, sector > 60727648 > Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code = > 0x000e > Nov 10 09:10:31 backup kernel: device-mapper: multipath: Failing path > 65:112. > > Interestingly, I tried a Windows 2008 server R2 talking over a single > connection to the storage unit, configured to access just via one > interface, I was able to sustain 20MB/s so it would ³appear² to be a > Linux-related issue - I'm only able to get 9MB/s out of Linux even when > using 8 interfaces on both controllers. > What version of open-iscsi were you using and what kernel, and were you using the iscsi kernel modules with open-iscsi.org tarball or from the kernel? It looks like we are sending more IO than the target can handle. In one of those cases it took more than 30 or 60 seconds (depending on your timeout value). What is the value of cat /sys/block/sdXYZ/device/timeout ? If it is 30 or 60 could you increase it to 360? After you login to the target do echo 360 > /sys/block/sdXYZ/device/timeout And what is the value of: iscsiadm -m node -T your_target | grep node.session.cmds_max If that is 128, then could you decrease that to 32 or 16? Run iscsiadm -m node -T your_target -u iscsiadm -m node -T your_target -o update -n node.session.cmds_max -v 32 iscsiad-m node -T your_target -l And if those prevent the io errors then could you do echo noop > /sys/block/sdXYZ/queue/scheduler to see if performance increases with a difference scheduler. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
On 11/6/09 3:39 PM, "Matthew Dickinson" wrote: > > >> >> Try disabling nops by setting >> >> node.conn[0].timeo.noop_out_interval = 0 >> node.conn[0].timeo.noop_out_timeout = 0 > I'm still getting errors: Nov 10 09:08:04 backup kernel: connection12:0: detected conn error (1011) Nov 10 09:08:05 backup iscsid: Kernel reported iSCSI connection 12:0 error (1011) state (3) Nov 10 09:08:08 backup iscsid: connection12:0 is operational after recovery (1 attempts) Nov 10 09:09:43 backup kernel: connection11:0: detected conn error (1011) Nov 10 09:09:43 backup kernel: connection12:0: detected conn error (1011) Nov 10 09:09:44 backup kernel: connection11:0: detected conn error (1011) Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 11:0 error (1011) state (3) Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 12:0 error (1011) state (3) Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 11:0 error (1011) state (1) Nov 10 09:09:46 backup kernel: session11: target reset succeeded\ Nov 10 09:09:47 backup iscsid: connection11:0 is operational after recovery (1 attempts) Nov 10 09:09:47 backup iscsid: connection12:0 is operational after recovery (1 attempts) Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code = 0x000e Nov 10 09:09:56 backup kernel: end_request: I/O error, dev sdv, sector 60721248 Nov 10 09:09:56 backup kernel: device-mapper: multipath: Failing path 65:80. Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code = 0x000e Nov 10 09:09:56 backup kernel: end_request: I/O error, dev sdv, sector 60727648 Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code = 0x000e Nov 10 09:10:31 backup kernel: device-mapper: multipath: Failing path 65:112. Interestingly, I tried a Windows 2008 server R2 talking over a single connection to the storage unit, configured to access just via one interface, I was able to sustain 20MB/s so it would ³appear² to be a Linux-related issue - I'm only able to get 9MB/s out of Linux even when using 8 interfaces on both controllers. Matthew --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
[r...@storage ~]# uname -r 2.6.18-164.el5 [r...@storage ~]# rpm -qa | grep iscsi iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 [r...@storage ~]# On Nov 10, 2009, at 12:17 PM, Mike Christie wrote: > > Hoot, Joseph wrote: >> I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate >> volumes for over a week and haven't had a single disconnect yet. I am >> currently using whatever rpm is distributed with Oracle VM v2.2. I know for >> sure that they have included the 871 base, plus I believe at least a one off >> patch. I can get more details if you'd like. >> >> But so far so good for now >> > > I think I have the source they are using. Could you do a uname -r, so I > can see what kernel they are using. > > > === Joseph R. Hoot Lead System Programmer/Analyst (w) 716-878-4832 (c) 716-759-HOOT joe.h...@itec.suny.edu GPG KEY: 7145F633 === --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Hoot, Joseph wrote: > I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate > volumes for over a week and haven't had a single disconnect yet. I am > currently using whatever rpm is distributed with Oracle VM v2.2. I know for > sure that they have included the 871 base, plus I believe at least a one off > patch. I can get more details if you'd like. > > But so far so good for now > I think I have the source they are using. Could you do a uname -r, so I can see what kernel they are using. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate volumes for over a week and haven't had a single disconnect yet. I am currently using whatever rpm is distributed with Oracle VM v2.2. I know for sure that they have included the 871 base, plus I believe at least a one off patch. I can get more details if you'd like. But so far so good for now On Nov 9, 2009, at 6:18 PM, Mike Christie wrote: > > Hoot, Joseph wrote: >> it was for OiS 871 code prior to RHEL 5.4 release (not sure if the >> release include it or not). I'm not sure who came up with it. I was >> working with Don Williams from Dell EqualLogic. He got ahold of it >> somehow. I applied it and it seemed to improve things. >> > > Ah ok. I think it was the patch I sent to Don. > > If you just used 871 without the patch (or what is in the stock RHEL 5.4 > kernel) does it work ok? There were a couple changes from 868 to 871 > that I thought would also fix the problem, so I was waiting for Don and > them to retest just 871 and get back to me. > >> >> On Nov 9, 2009, at 2:31 PM, Mike Christie wrote: >> >>> Hoot, Joseph wrote: What version of OiS are you using? I had lots of weirdness and the same types of disconnects to our Dell EqualLogic when we were (actually still are in production) using 868 code. I'm now using open- iscsi-871 code plus a sendwait patch and haven' had the issue. I've >>> What is the sendwait patch? Is it a patch for open-iscsi or to the >>> kernel network code? >>> >> >> === >> Joseph R. Hoot >> Lead System Programmer/Analyst >> (w) 716-878-4832 >> (c) 716-759-HOOT >> joe.h...@itec.suny.edu >> GPG KEY: 7145F633 >> === >> >> >>> > > > > === Joseph R. Hoot Lead System Programmer/Analyst (w) 716-878-4832 (c) 716-759-HOOT joe.h...@itec.suny.edu GPG KEY: 7145F633 === --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Hoot, Joseph wrote: > it was for OiS 871 code prior to RHEL 5.4 release (not sure if the > release include it or not). I'm not sure who came up with it. I was > working with Don Williams from Dell EqualLogic. He got ahold of it > somehow. I applied it and it seemed to improve things. > Ah ok. I think it was the patch I sent to Don. If you just used 871 without the patch (or what is in the stock RHEL 5.4 kernel) does it work ok? There were a couple changes from 868 to 871 that I thought would also fix the problem, so I was waiting for Don and them to retest just 871 and get back to me. > > On Nov 9, 2009, at 2:31 PM, Mike Christie wrote: > >> Hoot, Joseph wrote: >>> What version of OiS are you using? I had lots of weirdness and the >>> same types of disconnects to our Dell EqualLogic when we were >>> (actually still are in production) using 868 code. I'm now using >>> open- >>> iscsi-871 code plus a sendwait patch and haven' had the issue. I've >> What is the sendwait patch? Is it a patch for open-iscsi or to the >> kernel network code? >> > > === > Joseph R. Hoot > Lead System Programmer/Analyst > (w) 716-878-4832 > (c) 716-759-HOOT > joe.h...@itec.suny.edu > GPG KEY: 7145F633 > === > > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
it was for OiS 871 code prior to RHEL 5.4 release (not sure if the release include it or not). I'm not sure who came up with it. I was working with Don Williams from Dell EqualLogic. He got ahold of it somehow. I applied it and it seemed to improve things. On Nov 9, 2009, at 2:31 PM, Mike Christie wrote: > > Hoot, Joseph wrote: >> What version of OiS are you using? I had lots of weirdness and the >> same types of disconnects to our Dell EqualLogic when we were >> (actually still are in production) using 868 code. I'm now using >> open- >> iscsi-871 code plus a sendwait patch and haven' had the issue. I've > > What is the sendwait patch? Is it a patch for open-iscsi or to the > kernel network code? > > > === Joseph R. Hoot Lead System Programmer/Analyst (w) 716-878-4832 (c) 716-759-HOOT joe.h...@itec.suny.edu GPG KEY: 7145F633 === --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Hoot, Joseph wrote: > What version of OiS are you using? I had lots of weirdness and the > same types of disconnects to our Dell EqualLogic when we were > (actually still are in production) using 868 code. I'm now using open- > iscsi-871 code plus a sendwait patch and haven' had the issue. I've What is the sendwait patch? Is it a patch for open-iscsi or to the kernel network code? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Hi all, I am working on iSCSI En. Tar. Could you please someone explain about the performance of the IET. If so how the performance was calculated and what was the througput for the same. Thanks Gopala krishnan Varatharajan On Sat, Nov 7, 2009 at 3:09 AM, Matthew Dickinson < matt-openis...@alpha345.com> wrote: > > On 11/6/09 3:08 PM, "Mike Christie" wrote: > > > > > Could you send more of the log? Do you see a message like > > connection1:0 is > > operational after recovery (1 attempts) > > after you see the conn errors (how many attempts)? > > Here's one particular connection: > > Nov 4 05:12:14 backup kernel: connection22:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 4321648393, last ping 4321653393, now > 4321658393 > Nov 4 05:12:14 backup kernel: connection22:0: detected conn error (1011) > Nov 4 05:12:21 backup iscsid: connection22:0 is operational after recovery > (1 attempts) > Nov 4 05:12:46 backup kernel: connection22:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 4321680691, last ping 4321685691, now > 4321690691 > Nov 4 05:12:46 backup kernel: connection22:0: detected conn error (1011) > Nov 4 05:12:58 backup iscsid: connection22:0 is operational after recovery > (1 attempts) > Nov 4 07:46:03 backup kernel: connection22:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 4330877890, last ping 4330882890, now > 4330887890 > Nov 4 07:46:03 backup kernel: connection22:0: detected conn error (1011) > Nov 4 07:46:10 backup iscsid: connection22:0 is operational after recovery > (1 attempts) > Nov 4 07:46:27 backup kernel: connection22:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 4330901733, last ping 4330906733, now > 4330911733 > Nov 4 07:46:27 backup kernel: connection22:0: detected conn error (1011) > Nov 4 07:46:32 backup iscsid: connection22:0 is operational after recovery > (1 attempts) > Nov 4 07:47:21 backup kernel: connection22:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 4330955414, last ping 4330960414, now > 4330965414 > Nov 4 07:47:21 backup kernel: connection22:0: detected conn error (1011) > Nov 4 07:47:28 backup iscsid: connection22:0 is operational after recovery > (1 attempts) > Nov 4 07:48:28 backup kernel: connection22:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 4331023213, last ping 4331028213, now > 4331033213 > Nov 4 07:48:28 backup kernel: connection22:0: detected conn error (1011) > Nov 4 07:48:35 backup iscsid: connection22:0 is operational after recovery > (1 attempts) > > FWIW: > > [r...@backup ~]# cat /var/log/messages | grep "after recovery" | awk > '{print > $11" "$12}' | sort | uniq > (113 attempts) > (1 attempts) > (24 attempts) > (2 attempts) > (3 attempts) > (4 attempts) > (5 attempts) > (66 attempts) > (68 attempts) > (6 attempts) > (7 attempts) > (8 attempts) > (9 attempts) > > > > > Try disabling nops by setting > > > > node.conn[0].timeo.noop_out_interval = 0 > > node.conn[0].timeo.noop_out_timeout = 0 > > Ok, I'll let you know how it pans out. > > Thanks, > > Matthew > > > > > > -- --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
What version of OiS are you using? I had lots of weirdness and the same types of disconnects to our Dell EqualLogic when we were (actually still are in production) using 868 code. I'm now using open- iscsi-871 code plus a sendwait patch and haven' had the issue. I've now been slamming my storage for a week and a half with multiple threads of dt. On Nov 9, 2009, at 4:33 AM, Santi Saez wrote: > > El 06/11/09 14:10, mdaitc escribió: > > Hi mdaitc, > >> I’m seeing similar TCP “weirdness” as the other posts mention as >> well >> as the below errors. > > (..) > >> Nov 2 08:15:14 backup kernel: connection33:0: detected conn error >> The performance isn’t what I’d expect: > > (..) > > What happens if you disable TCP window scaling option in RHEL servers? > > # echo 0 > /proc/sys/net/ipv4/tcp_window_scaling > > In our case, iSCSI "conn errors" stopped after disabling, but still > have > a lot of TCP “weirdness” in the network, mainly dup ACKs packages. > > Regards, > > -- > Santi Saez > http://woop.es > > > === Joseph R. Hoot Lead System Programmer/Analyst (w) 716-878-4832 (c) 716-759-HOOT joe.h...@itec.suny.edu GPG KEY: 7145F633 === --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
El 06/11/09 14:10, mdaitc escribió: Hi mdaitc, > I’m seeing similar TCP “weirdness” as the other posts mention as well > as the below errors. (..) > Nov 2 08:15:14 backup kernel: connection33:0: detected conn error > The performance isn’t what I’d expect: (..) What happens if you disable TCP window scaling option in RHEL servers? # echo 0 > /proc/sys/net/ipv4/tcp_window_scaling In our case, iSCSI "conn errors" stopped after disabling, but still have a lot of TCP “weirdness” in the network, mainly dup ACKs packages. Regards, -- Santi Saez http://woop.es --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
El 03/11/09 0:52, Mike Christie escribió: Dear Mike, > You can turn off ping/nops by setting > > node.conn[0].timeo.noop_out_interval = 0 > node.conn[0].timeo.noop_out_timeout = 0 > > (set that in iscsid.conf then rediscovery the target or run "iscsiadm -m > node -T your_target -o update -n name_of_param_above -v 0" Thanks!! As I said to James in the previous email, disabling TCP window scaling *solves partially* this problem, we still hold nop pings in the configuration. But still have too many "TCP Dup ACKs" in the network :-S > This might just work around. What might happen is that you will not see > the nop/ping and conn errors and instead would just see a slow down in > the workloads being run. I have sent your contact to Infortrend developers, a engineer will contact you, thanks! Regards, -- Santi Saez http://woop.es --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
El 02/11/09 19:43, James A. T. Rice escribió: Dear James, > That looks vaguely familiar, although I think mine was nop-out timeout > (might be reported in another log file). Does it mostly happen when you do > long sequential reads from the Infortrend unit? In my case it turned out > to be a very low level of packet drops being caused by a cisco 2960G when > 'mls qos' was enabled (which due to an IOS bug, didn't increment the drop > counter). I'm not sure if the loss when 'mls qos' is enabled is by design > as part of WRED, or a function of the port buffers being divided up into > things smaller than optimal. > > Having TCP window scaling enabled made the problem an order of magnitude > worse, try disabling it and seeing if you have the same problem still? > (suggest something like dd if=/dev/sdc of=/dev/null bs=1048576 count=10 to > see if that triggers it, assuming it was the same problem I was > suffering). > > Every other iSCSI target I've tried recovered pretty gracefully from this, > but not the Infortrend, I suspect their TCP retransmit algorithm needs a > lot of love. I suspect it's pathologically broken when window scaling is > enabled. Disabling TCP window scaling [1] on Linux solves nop-out problem, we don't get more "iscsi: detected conn error" and performance improves :) It's very strange, we have 3 Cisco 2960G in the SAN and this behavior only occurs in two of them, we're looking in depth this problem. Nop-out has been solved but we still have a lot of "duplicate ACKs" in all machines. I will update this post with more info. James, thanks a lot of for the help. Regards, [1] # echo 0 > /proc/sys/net/ipv4/tcp_window_scaling -- Santi Saez http://woop.es --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Hi! I would guess that your network/storage system is overloaded occasionally, and then network packets are significantly (>5s) delayed. It sounds unlikely, but that would explain a duplicate ACK IMHO. Here with SLES10 SP2 on x86_64 with a HP EVA 6000 + iSCSI connectivity option (MPX100) there is no such problem, even though there are about 20 hosts connected to the storage system, most of them directly through a fabric switch. However the iSCSI load is light. Here is what one of the two MPX boxes reports (the load will rise; it's still early morning): #> show perf byte WARNING: Valid data is only displayed for port(s) that are not associated with any configured FCIP routes. Displaying bytes/sec (total)... (Press any key to stop display) GE1 GE2 FC1 FC2 53K 606K43K 620K 774 774 1K 1K 651K868 17K 636K 2K 26K 18K 10K 49K 0 22K 26K 55K 805K799K66K 29K 17K 40K 7K 597K7K 534K72K 172 196K150K47K 49K 516 25K 28K 294K43K 29K 313K 774 774 1K 1K 228K868 65K 171K 18K 20K 528 43K 6K 6K 0 12K 145K516 101K45K 15K 29K 15K 30K (GEx are the gigabit Ethernet ports, FCx are the FibreChannel ports that are connected with the storage system via a fabric switch) Some more statistics (for those who care or like inspiration): #> show stats FC Port Statistics FC Port 1 Interrupt Count 82812963 Target Command Count 0 Initiator Command Count 83526964 Link Failure Count 0 Loss of Sync Count 0 Loss of Signal Count 0 Primitive Sequence Error Count 0 Invalid Transmission Word Count 0 Invalid CRC Error Count 0 FC Port 2 Interrupt Count 81551444 Target Command Count 0 Initiator Command Count 82221649 Link Failure Count 0 Loss of Sync Count 0 Loss of Signal Count 0 Primitive Sequence Error Count 0 Invalid Transmission Word Count 0 Invalid CRC Error Count 0 iSCSI Port Statistics --- iSCSI Port 1 Interrupt Count 190227478 Target Command Count 100974368 Initiator Command Count 0 MAC Xmit Frames 122882662 MAC Xmit Byte Count 28445475500 MAC Xmit Multicast Frames0 MAC Xmit Broadcast Frames0 MAC Xmit Pause Frames0 MAC Xmit Control Frames 0 MAC Xmit Deferrals 0 MAC Xmit Late Collisions 0 MAC Xmit Aborted 0 MAC Xmit Single Collisions 0 MAC Xmit Multiple Collisions 0 MAC Xmit Collisions 0 MAC Xmit Dropped Frames 0 MAC Xmit Jumbo Frames1702770 MAC Rcvd Frames 156585933 MAC Rcvd Byte Count 187107689124 MAC Rcvd Unknown Control Frames 0 MAC Rcvd Pause Frames0 MAC Rcvd Control Frames 0 MAC Rcvd Dribbles0 MAC Rcvd Frame Length Errors 0 MAC Rcvd Jabbers 0 MAC Rcvd Carrier Sense Errors0 MAC Rcvd Dropped Frames 0 MAC Rcvd CRC Errors 0 MAC Rcvd Encoding Errors 0 MAC Rcvd Length Errors Large 0 MAC Rcvd Length Errors Small 0 MAC Rcvd Multicast Frames178239 MAC Rcvd Broadcast Frames48 iSCSI Port 2 Interrupt Count 182729151 Target Command Count 97555857 Initiator Command Count 0 MAC Xmit Frames 120067355 MAC Xmit Byte Count 27948414986 MAC Xmit Multicast Frames0 MAC Xmit Broadcast Frames0 MAC Xmit Pause Frames0 MAC Xmit Control Frames 0 MAC Xmit Deferrals 0 MAC Xmit Late Collisions 0 MAC Xmit Aborted 0 MAC Xmit Single Collisions 0 MAC Xmit Multiple Collisions 0 MAC Xmit Collisions 0 MAC Xmit Dropped Frames 0 MAC Xmit Jumbo Frames1718670 MAC Rcvd Frames 156126093 MAC Rcvd Byte Count 196255090216 MAC Rcvd Unknown Control Frames 0 MAC Rcvd Pause Frames0 MAC Rcvd Control Frames 0 MAC Rcvd Dribbles0 MAC Rcvd Frame Length Errors 0 MAC Rcvd Jabbers 0 MAC Rcvd Carrier Sense Errors0 MAC Rcvd Dropped Frames 0 MAC Rcvd CRC Errors 0 MAC Rcvd Encoding Errors 0 MAC Rcvd Length Errors Large 0 MAC Rcvd Length Errors Small 0
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
Santi Saez wrote: > > Hi, > > Randomly we get Open-iSCSI "conn errors" when connecting to an > Infortrend A16E-G2130-4 storage array. We had discussed about this > earlier in the list, see: > > http://tr.im/DVQm > http://tr.im/DVQp > > Open-iSCSI logs this: > > === > Nov 2 18:34:02 vz-17 kernel: ping timeout of 5 secs expired, last rx > 408250499, last ping 408249467, now 408254467 > Nov 2 18:34:02 vz-17 kernel: connection1:0: iscsi: detected conn > error (1011) > Nov 2 18:34:03 vz-17 iscsid: Kernel reported iSCSI connection 1:0 > error (1011) state (3) > Nov 2 18:34:07 vz-17 iscsid: connection1:0 is operational after > recovery (1 attempts) > Nov 2 18:34:52 vz-17 kernel: ping timeout of 5 secs expired, last rx > 408294833, last ping 408299833, now 408304833 > Nov 2 18:34:52 vz-17 kernel: connection1:0: iscsi: detected conn > error (1011) > Nov 2 18:34:53 vz-17 iscsid: Kernel reported iSCSI connection 1:0 > error (1011) state (3) > Nov 2 18:34:57 vz-17 iscsid: connection1:0 is operational after > recovery (1 attempts) > === > > Running on CentOS 5.4 with "iscsi-initiator-utils-6.2.0.871-0.10.el5"; > I think it's not a Open-iSCSI bug as Mike suggested at: > > http://groups.google.com/group/open-iscsi/msg/fe37156096b2955f > > I have only this error when connecting to Infortrend storage, and not > with NetApp, Nexsan, etc. *connected in the same SAN*. > > Using Wireshark I see a lot of "TCP Dup ACK", "TCP ACKed lost > segment", etc. and iSCSI session finally ends in timeout, see a > screenshot here: > > http://tinyurl.com/ykpvckn > > Using Wireshark IO graphs I get this strange report about TCP/IP errors: > > http://tinyurl.com/ybm4m8x > > And this is another report in the same SAN connecting to a NetApp: > > http://tinyurl.com/ycgc8ul > > Those TCP/IP errors only occurs when connecting to Infortrend > storage.. and no with other targets in the same SAN (using same switch > infrastructure); is there anyway to deal with this using Open-iSCSI? > As I see in Internet, there're a lot of Infortrend's users suffering > this behavior. > You can turn off ping/nops by setting node.conn[0].timeo.noop_out_interval = 0 node.conn[0].timeo.noop_out_timeout = 0 (set that in iscsid.conf then rediscovery the target or run "iscsiadm -m node -T your_target -o update -n name_of_param_above -v 0" Or you might want to set them higher. This might just work around. What might happen is that you will not see the nop/ping and conn errors and instead would just see a slow down in the workloads being run. If you guys can get a hold of any infrotrend people let me know, because I would be happy to work with them on this. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Infortrend + "iSCSI: detected conn error (1011)" + "TCP Dup ACK"
On Mon, 2 Nov 2009, Santi Saez wrote: > Randomly we get Open-iSCSI "conn errors" when connecting to an > Infortrend A16E-G2130-4 storage array. We had discussed about this > earlier in the list, see: > Nov 2 18:34:02 vz-17 kernel: ping timeout of 5 secs expired, last rx > 408250499, last ping 408249467, now 408254467 > Nov 2 18:34:02 vz-17 kernel: connection1:0: iscsi: detected conn error > (1011) > Nov 2 18:34:03 vz-17 iscsid: Kernel reported iSCSI connection 1:0 error > (1011) state (3) > Nov 2 18:34:07 vz-17 iscsid: connection1:0 is operational after recovery (1 > attempts) > Nov 2 18:34:52 vz-17 kernel: ping timeout of 5 secs expired, last rx That looks vaguely familiar, although I think mine was nop-out timeout (might be reported in another log file). Does it mostly happen when you do long sequential reads from the Infortrend unit? In my case it turned out to be a very low level of packet drops being caused by a cisco 2960G when 'mls qos' was enabled (which due to an IOS bug, didn't increment the drop counter). I'm not sure if the loss when 'mls qos' is enabled is by design as part of WRED, or a function of the port buffers being divided up into things smaller than optimal. Having TCP window scaling enabled made the problem an order of magnitude worse, try disabling it and seeing if you have the same problem still? (suggest something like dd if=/dev/sdc of=/dev/null bs=1048576 count=10 to see if that triggers it, assuming it was the same problem I was suffering). Every other iSCSI target I've tried recovered pretty gracefully from this, but not the Infortrend, I suspect their TCP retransmit algorithm needs a lot of love. I suspect it's pathologically broken when window scaling is enabled. Sadly when I opened a ticket with Infortrend, enclosing tcpdumps and analysis, they were no more useful than to let me know they don't support debian (despite having instructions for debian iscsi on their website), and they don't support Western Digital drives in redundant controller configurations (nice of them to have this somewhere public). Hope that might be somewhat useful to you.. Please to let me / the list know how you get on. There was sadly little information on this when I was tearing my hair out about it. Best wishes James -- James Rice Jump Networks Ltd --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---