Re: [Linux-HA] SANs falling over, don't know why!

Andreas Kurz Mon, 31 Oct 2011 03:45:46 -0700

On 10/31/2011 10:26 AM, James Smith wrote:
> I think I have discovered the problem.
> 
> The secondary server, has a degraded array.  So it's not in an optimal state. 
>  This certainly explains the odd IO drop outs I was encountering, where disk 
> busy would be 100% but with no read or write activity for 20 seconds.
> 
> But, it is a bit odd slow IO on the secondary would cause the master, which 
> has no degraded array, to also slow down, especially when running Protocol A 
> in drbd?


Since DRBD 8.3.10 there is "on-congestion" option to allow DRBD to go
into ahead/behind mode before the TCP queues get full ... then you get a
out-of-sync secondary but your primary will not suffer from the
performance degradation of the secondary.

Regards,
Andreas

-- 
Need help with Pacemaker or DRBD?
http://www.hastexo.com/now


> 
> The master has been up without any iet errors now for 24 hours, which is an 
> improvement on the six hour maximum before dying over the weekend ...
> 
> Config is below.
> 
> node iscsi1cl6 \
>         attributes standby="on"
> node iscsi2cl6 \
>         attributes standby="off"
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>         params ip="10.100.0.105" cidr_netmask="255.255.255.0" nic="vlan158" \
>         op monitor interval="1s" \
>         meta migration-threshold="3"
> primitive iscsidrbd ocf:linbit:drbd \
>         params drbd_resource="iscsidisk" \
>         op monitor interval="1s" \
>         meta migration-threshold="3"
> primitive iscsilun ocf:heartbeat:iSCSILogicalUnit \
>         params implementation="iet" lun="6" 
> target_iqn="iqn.2010-05.iscsicl6:LUN06.sanvol" path="/dev/drbd0" 
> scsi_id="19101000101cl6iscsi" additional_parameters="type=fileio" \
>         op monitor interval="15s" timeout="10s" \
>         meta target-role="Started" migration-threshold="3"
> primitive iscsitarget ocf:heartbeat:iSCSITarget \
>         params implementation="iet" iqn="iqn.2010-05.iscsicl6:LUN06.sanvol" 
> portals="" \
>         meta target-role="Started" migration-threshold="3" \
>         op monitor interval="15s" timeout="10s"
> primitive mail_me ocf:heartbeat:MailTo \
>         params email="a...@a.com" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s" \
>         op monitor interval="10" timeout="10" depth="0" \
>         meta target-role="Started"
> group coregroup ClusterIP iscsitarget iscsilun mail_me
> ms iscsidrbdclone iscsidrbd \
>         meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" target-role="Started" migration-threshold="3"
> colocation core_group-with-iscsidrbdclone inf: coregroup iscsidrbdclone:Master
> order iscsidrbdclone-before-core_group inf: iscsidrbdclone:promote 
> iscsitarget:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1319966098"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100"
> 
> Regards,
> 
> James
> 
> -----Original Message-----
> From: linux-ha-boun...@lists.linux-ha.org 
> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Nick Khamis
> Sent: 30 October 2011 20:44
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] SANs falling over, don't know why!
> 
> Are you using the IPAddr2 primitive? Maybe post your configuration?
> 
> Nick.
> 
> On Sun, Oct 30, 2011 at 4:35 PM, James Smith <james.sm...@m247.com> wrote:
>> Ipv4.
>>
>> Regards,
>>
>> James
>>
>>
>> -----Original Message-----
>> From: linux-ha-boun...@lists.linux-ha.org 
>> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Nick Khamis
>> Sent: 30 October 2011 12:28
>> To: General Linux-HA mailing list
>> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>>
>> Are you using IPV4 or 6?
>>
>> Nick.
>>
>> On Sun, Oct 30, 2011 at 4:29 AM, James Smith <james.sm...@m247.com> wrote:
>>> Well fileio hasn't solved the underlying issue, the SAN broke this morning 
>>> at 6AM:
>>>
>>> Oct 30 06:01:19 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued 
>>> on tid:1 lun:6 by sid:4222124721766912 (Function Complete) Oct 30
>>> 06:01:20 iscsi1cl6 lrmd: [3770]: info: RA output:
>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>> 24 Oct 30 06:01:47 iscsi1cl6 last message repeated 27 times Oct 30
>>> 06:01:48 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on 
>>> tid:1
>>> lun:6 by sid:5066549651898880 (Function Complete) Oct 30 06:01:48
>>> iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 
>>> by
>>> sid:5066549651898880 (Function Complete) Oct 30 06:01:48 iscsi1cl6
>>> lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted 
>>> dotted-quad netmask to CIDR as: 24 Oct 30 06:02:19 iscsi1cl6 last 
>>> message repeated 30 times Oct 30 06:02:48 iscsi1cl6 last message 
>>> repeated 28 times Oct 30 06:02:49 iscsi1cl6 kernel: iscsi_trgt: Abort 
>>> Task (01) issued on tid:1 lun:6 by sid:5066549651898880 (Function
>>> Complete) Oct 30 06:02:49 iscsi1cl6 lrmd: [3770]: info: RA output:
>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>> 24 Oct 30 06:02:51 iscsi1cl6 last message repeated 2 times Oct 30
>>> 06:02:52 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on 
>>> tid:1
>>> lun:6 by sid:4222124721766912 (Function Complete) Oct 30 06:02:52
>>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) 
>>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:17 
>>> iscsi1cl6 last message repeated 24 times Oct 30 06:03:18 iscsi1cl6 kernel:
>>> iscsi_trgt: cmnd_rx_start(1849) 1 3b000030 -7 Oct 30 06:03:18
>>> iscsi1cl6 kernel: iscsi_trgt: cmnd_skip_pdu(459) 3b000030 1 2a 4096 
>>> Oct 30 06:03:18 iscsi1cl6 lrmd: [3770]: info: RA output:
>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>> 24 Oct 30 06:03:49 iscsi1cl6 last message repeated 30 times Oct 30
>>> 06:04:32 iscsi1cl6 last message repeated 42 times Oct 30 06:04:33
>>> iscsi1cl6 cib: [3769]: info: cib_stats: Processed 1 operations 
>>> (10000.00us average, 0% utilization) in the last 10min Oct 30 
>>> 06:04:33
>>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) 
>>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:05:04 
>>> iscsi1cl6 last message repeated 30 times Oct 30 06:05:41 iscsi1cl6 
>>> last message repeated 36 times Oct 30 06:05:42 iscsi1cl6 kernel: 
>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by 
>>> sid:5629499605320192 (Function
>>> Complete)
>>>
>>>
>>> Regards,
>>>
>>> James
>>>
>>> -----Original Message-----
>>> From: linux-ha-boun...@lists.linux-ha.org
>>> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of James Smith
>>> Sent: 30 October 2011 00:25
>>> To: General Linux-HA mailing list
>>> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>>>
>>> Hi,
>>>
>>> Changed nothing to my knowledge :p
>>>
>>> These boxes don't currently have fencing enabled.  I imagine the reboot is 
>>> caused by a kernel panic, sysctl is set to reboot on this.
>>>
>>> There is one big 4TB LUN, used by several VMs on XenServer, each with 
>>> multiple disks.
>>>
>>> In my quest to resolve, I have changed iet to use fileio instead of blockio 
>>> and fiddled with some drbd performance related bits 
>>> (http://www.drbd.org/users-guide/s-latency-tuning.html).
>>>
>>> If I'm woken up again tonight with this thing breaking it's going in the 
>>> bin.  I'll probably also ditch ietd and try open-iscsi or iscsi-scst.  
>>> Monday morning I'll be shifting some load off this cluster also.
>>>
>>> Regards,
>>>
>>> James
>>>
>>> -----Original Message-----
>>> From: linux-ha-boun...@lists.linux-ha.org
>>> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andreas 
>>> Kurz
>>> Sent: 29 October 2011 22:36
>>> To: linux-ha@lists.linux-ha.org
>>> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>>>
>>> Hello,
>>>
>>> On 10/29/2011 08:47 PM, James Smith wrote:
>>>> Hi,
>>>>
>>>> All of a sudden, a SAN pair which was running without any problems for six 
>>>> months, now decides to fall over every couple of hours.
>>>
>>> So what did you change? ;-)
>>>
>>>>
>>>> The logs I have to go on are below:
>>>>
>>>> Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29
>>>> 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
>>>> tid:1
>>>> lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24
>>>> iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) 
>>>> Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49
>>>> iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel:
>>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>>> sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6
>>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>>> sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6
>>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>>> sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6
>>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
>>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel:
>>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>>> sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6
>>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
>>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last 
>>>> message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt:
>>>> Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 
>>>> (Function
>>>> Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task
>>>> (01) issued on tid:1 lun:6 by sid:844424967684608 (Function 
>>>> Complete) Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output:
>>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>>> 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29
>>>> 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
>>>> tid:1
>>>> lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06
>>>> iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6
>>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected!
>>>> [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 
>>>> 29
>>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent 
>>>> local write detected! [DISCARD L] new: 2077806184s +512; pending:
>>>> 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0:
>>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>>> 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06
>>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write 
>>>> detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s
>>>> +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695]
>>>> Concurrent local write detected! [DISCARD L] new: 1693425321s +3584;
>>>> pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block
>>>> drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>>> 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 
>>>> iscsi2cl6
>>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected!
>>>> [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct 
>>>> 29
>>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent 
>>>> local write detected! [DISCARD L] new: 1693425320s +512; pending:
>>>> 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0:
>>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>>> 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06
>>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write 
>>>> detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s
>>>> +512
>>>
>>> Concurrent local writes .... Is there any kind of cluster software using a 
>>> shared quorum disk or sthg. like that using this lun? Or this lun shared 
>>> between several VMWare ESX VMs?
>>>
>>>> Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output:
>>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>>> 24
>>>>
>>>> After this event, both members of the SAN pair reboot.  It is very 
>>>> disruptive, as it's killing the VMs using this SAN, requiring fsck's after 
>>>> failure.  The load on the SAN doesn't need to be very high for this happen.
>>>>
>>>
>>> They reboot because of a kernel panic, or because of some fencing mechanism?
>>>
>>>> Running the following:
>>>>
>>>> CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker
>>>> 1.0.11-1.2.el5 DRBD 8.3.11
>>>
>>> Would be interesting to see Pacemaker/DRBD/IET config ....
>>>
>>> Regards,
>>> Andreas
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] SANs falling over, don't know why!

Reply via email to