On 10/31/2011 10:26 AM, James Smith wrote: > I think I have discovered the problem. > > The secondary server, has a degraded array. So it's not in an optimal state. > This certainly explains the odd IO drop outs I was encountering, where disk > busy would be 100% but with no read or write activity for 20 seconds. > > But, it is a bit odd slow IO on the secondary would cause the master, which > has no degraded array, to also slow down, especially when running Protocol A > in drbd?
Since DRBD 8.3.10 there is "on-congestion" option to allow DRBD to go into ahead/behind mode before the TCP queues get full ... then you get a out-of-sync secondary but your primary will not suffer from the performance degradation of the secondary. Regards, Andreas -- Need help with Pacemaker or DRBD? http://www.hastexo.com/now > > The master has been up without any iet errors now for 24 hours, which is an > improvement on the six hour maximum before dying over the weekend ... > > Config is below. > > node iscsi1cl6 \ > attributes standby="on" > node iscsi2cl6 \ > attributes standby="off" > primitive ClusterIP ocf:heartbeat:IPaddr2 \ > params ip="10.100.0.105" cidr_netmask="255.255.255.0" nic="vlan158" \ > op monitor interval="1s" \ > meta migration-threshold="3" > primitive iscsidrbd ocf:linbit:drbd \ > params drbd_resource="iscsidisk" \ > op monitor interval="1s" \ > meta migration-threshold="3" > primitive iscsilun ocf:heartbeat:iSCSILogicalUnit \ > params implementation="iet" lun="6" > target_iqn="iqn.2010-05.iscsicl6:LUN06.sanvol" path="/dev/drbd0" > scsi_id="19101000101cl6iscsi" additional_parameters="type=fileio" \ > op monitor interval="15s" timeout="10s" \ > meta target-role="Started" migration-threshold="3" > primitive iscsitarget ocf:heartbeat:iSCSITarget \ > params implementation="iet" iqn="iqn.2010-05.iscsicl6:LUN06.sanvol" > portals="" \ > meta target-role="Started" migration-threshold="3" \ > op monitor interval="15s" timeout="10s" > primitive mail_me ocf:heartbeat:MailTo \ > params email="a...@a.com" \ > op start interval="0" timeout="60s" \ > op stop interval="0" timeout="60s" \ > op monitor interval="10" timeout="10" depth="0" \ > meta target-role="Started" > group coregroup ClusterIP iscsitarget iscsilun mail_me > ms iscsidrbdclone iscsidrbd \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" target-role="Started" migration-threshold="3" > colocation core_group-with-iscsidrbdclone inf: coregroup iscsidrbdclone:Master > order iscsidrbdclone-before-core_group inf: iscsidrbdclone:promote > iscsitarget:start > property $id="cib-bootstrap-options" \ > dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1319966098" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" > > Regards, > > James > > -----Original Message----- > From: linux-ha-boun...@lists.linux-ha.org > [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Nick Khamis > Sent: 30 October 2011 20:44 > To: General Linux-HA mailing list > Subject: Re: [Linux-HA] SANs falling over, don't know why! > > Are you using the IPAddr2 primitive? Maybe post your configuration? > > Nick. > > On Sun, Oct 30, 2011 at 4:35 PM, James Smith <james.sm...@m247.com> wrote: >> Ipv4. >> >> Regards, >> >> James >> >> >> -----Original Message----- >> From: linux-ha-boun...@lists.linux-ha.org >> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Nick Khamis >> Sent: 30 October 2011 12:28 >> To: General Linux-HA mailing list >> Subject: Re: [Linux-HA] SANs falling over, don't know why! >> >> Are you using IPV4 or 6? >> >> Nick. >> >> On Sun, Oct 30, 2011 at 4:29 AM, James Smith <james.sm...@m247.com> wrote: >>> Well fileio hasn't solved the underlying issue, the SAN broke this morning >>> at 6AM: >>> >>> Oct 30 06:01:19 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued >>> on tid:1 lun:6 by sid:4222124721766912 (Function Complete) Oct 30 >>> 06:01:20 iscsi1cl6 lrmd: [3770]: info: RA output: >>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >>> 24 Oct 30 06:01:47 iscsi1cl6 last message repeated 27 times Oct 30 >>> 06:01:48 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on >>> tid:1 >>> lun:6 by sid:5066549651898880 (Function Complete) Oct 30 06:01:48 >>> iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 >>> by >>> sid:5066549651898880 (Function Complete) Oct 30 06:01:48 iscsi1cl6 >>> lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted >>> dotted-quad netmask to CIDR as: 24 Oct 30 06:02:19 iscsi1cl6 last >>> message repeated 30 times Oct 30 06:02:48 iscsi1cl6 last message >>> repeated 28 times Oct 30 06:02:49 iscsi1cl6 kernel: iscsi_trgt: Abort >>> Task (01) issued on tid:1 lun:6 by sid:5066549651898880 (Function >>> Complete) Oct 30 06:02:49 iscsi1cl6 lrmd: [3770]: info: RA output: >>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >>> 24 Oct 30 06:02:51 iscsi1cl6 last message repeated 2 times Oct 30 >>> 06:02:52 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on >>> tid:1 >>> lun:6 by sid:4222124721766912 (Function Complete) Oct 30 06:02:52 >>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) >>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:17 >>> iscsi1cl6 last message repeated 24 times Oct 30 06:03:18 iscsi1cl6 kernel: >>> iscsi_trgt: cmnd_rx_start(1849) 1 3b000030 -7 Oct 30 06:03:18 >>> iscsi1cl6 kernel: iscsi_trgt: cmnd_skip_pdu(459) 3b000030 1 2a 4096 >>> Oct 30 06:03:18 iscsi1cl6 lrmd: [3770]: info: RA output: >>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >>> 24 Oct 30 06:03:49 iscsi1cl6 last message repeated 30 times Oct 30 >>> 06:04:32 iscsi1cl6 last message repeated 42 times Oct 30 06:04:33 >>> iscsi1cl6 cib: [3769]: info: cib_stats: Processed 1 operations >>> (10000.00us average, 0% utilization) in the last 10min Oct 30 >>> 06:04:33 >>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) >>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:05:04 >>> iscsi1cl6 last message repeated 30 times Oct 30 06:05:41 iscsi1cl6 >>> last message repeated 36 times Oct 30 06:05:42 iscsi1cl6 kernel: >>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>> sid:5629499605320192 (Function >>> Complete) >>> >>> >>> Regards, >>> >>> James >>> >>> -----Original Message----- >>> From: linux-ha-boun...@lists.linux-ha.org >>> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of James Smith >>> Sent: 30 October 2011 00:25 >>> To: General Linux-HA mailing list >>> Subject: Re: [Linux-HA] SANs falling over, don't know why! >>> >>> Hi, >>> >>> Changed nothing to my knowledge :p >>> >>> These boxes don't currently have fencing enabled. I imagine the reboot is >>> caused by a kernel panic, sysctl is set to reboot on this. >>> >>> There is one big 4TB LUN, used by several VMs on XenServer, each with >>> multiple disks. >>> >>> In my quest to resolve, I have changed iet to use fileio instead of blockio >>> and fiddled with some drbd performance related bits >>> (http://www.drbd.org/users-guide/s-latency-tuning.html). >>> >>> If I'm woken up again tonight with this thing breaking it's going in the >>> bin. I'll probably also ditch ietd and try open-iscsi or iscsi-scst. >>> Monday morning I'll be shifting some load off this cluster also. >>> >>> Regards, >>> >>> James >>> >>> -----Original Message----- >>> From: linux-ha-boun...@lists.linux-ha.org >>> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andreas >>> Kurz >>> Sent: 29 October 2011 22:36 >>> To: linux-ha@lists.linux-ha.org >>> Subject: Re: [Linux-HA] SANs falling over, don't know why! >>> >>> Hello, >>> >>> On 10/29/2011 08:47 PM, James Smith wrote: >>>> Hi, >>>> >>>> All of a sudden, a SAN pair which was running without any problems for six >>>> months, now decides to fall over every couple of hours. >>> >>> So what did you change? ;-) >>> >>>> >>>> The logs I have to go on are below: >>>> >>>> Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29 >>>> 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on >>>> tid:1 >>>> lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24 >>>> iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) >>>> Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49 >>>> iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel: >>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>>> sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6 >>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>>> sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6 >>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>>> sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6 >>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted >>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel: >>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>>> sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6 >>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted >>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last >>>> message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: >>>> Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 >>>> (Function >>>> Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task >>>> (01) issued on tid:1 lun:6 by sid:844424967684608 (Function >>>> Complete) Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: >>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >>>> 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29 >>>> 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on >>>> tid:1 >>>> lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06 >>>> iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6 >>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected! >>>> [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct >>>> 29 >>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent >>>> local write detected! [DISCARD L] new: 2077806184s +512; pending: >>>> 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: >>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new: >>>> 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06 >>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write >>>> detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s >>>> +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] >>>> Concurrent local write detected! [DISCARD L] new: 1693425321s +3584; >>>> pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block >>>> drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: >>>> 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 >>>> iscsi2cl6 >>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected! >>>> [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct >>>> 29 >>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent >>>> local write detected! [DISCARD L] new: 1693425320s +512; pending: >>>> 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: >>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new: >>>> 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06 >>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write >>>> detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s >>>> +512 >>> >>> Concurrent local writes .... Is there any kind of cluster software using a >>> shared quorum disk or sthg. like that using this lun? Or this lun shared >>> between several VMWare ESX VMs? >>> >>>> Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: >>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >>>> 24 >>>> >>>> After this event, both members of the SAN pair reboot. It is very >>>> disruptive, as it's killing the VMs using this SAN, requiring fsck's after >>>> failure. The load on the SAN doesn't need to be very high for this happen. >>>> >>> >>> They reboot because of a kernel panic, or because of some fencing mechanism? >>> >>>> Running the following: >>>> >>>> CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker >>>> 1.0.11-1.2.el5 DRBD 8.3.11 >>> >>> Would be interesting to see Pacemaker/DRBD/IET config .... >>> >>> Regards, >>> Andreas >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> >>> >>> _______________________________________________ >>> Linux-HA mailing list >>> Linux-HA@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >>> _______________________________________________ >>> Linux-HA mailing list >>> Linux-HA@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >>> >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems