I think I have discovered the problem.

The secondary server, has a degraded array.  So it's not in an optimal state.  
This certainly explains the odd IO drop outs I was encountering, where disk 
busy would be 100% but with no read or write activity for 20 seconds.

But, it is a bit odd slow IO on the secondary would cause the master, which has 
no degraded array, to also slow down, especially when running Protocol A in 
drbd?

The master has been up without any iet errors now for 24 hours, which is an 
improvement on the six hour maximum before dying over the weekend ...

Config is below.

node iscsi1cl6 \
        attributes standby="on"
node iscsi2cl6 \
        attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.100.0.105" cidr_netmask="255.255.255.0" nic="vlan158" \
        op monitor interval="1s" \
        meta migration-threshold="3"
primitive iscsidrbd ocf:linbit:drbd \
        params drbd_resource="iscsidisk" \
        op monitor interval="1s" \
        meta migration-threshold="3"
primitive iscsilun ocf:heartbeat:iSCSILogicalUnit \
        params implementation="iet" lun="6" 
target_iqn="iqn.2010-05.iscsicl6:LUN06.sanvol" path="/dev/drbd0" 
scsi_id="19101000101cl6iscsi" additional_parameters="type=fileio" \
        op monitor interval="15s" timeout="10s" \
        meta target-role="Started" migration-threshold="3"
primitive iscsitarget ocf:heartbeat:iSCSITarget \
        params implementation="iet" iqn="iqn.2010-05.iscsicl6:LUN06.sanvol" 
portals="" \
        meta target-role="Started" migration-threshold="3" \
        op monitor interval="15s" timeout="10s"
primitive mail_me ocf:heartbeat:MailTo \
        params email="a...@a.com" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        op monitor interval="10" timeout="10" depth="0" \
        meta target-role="Started"
group coregroup ClusterIP iscsitarget iscsilun mail_me
ms iscsidrbdclone iscsidrbd \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true" target-role="Started" migration-threshold="3"
colocation core_group-with-iscsidrbdclone inf: coregroup iscsidrbdclone:Master
order iscsidrbdclone-before-core_group inf: iscsidrbdclone:promote 
iscsitarget:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1319966098"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

Regards,

James

-----Original Message-----
From: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Nick Khamis
Sent: 30 October 2011 20:44
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] SANs falling over, don't know why!

Are you using the IPAddr2 primitive? Maybe post your configuration?

Nick.

On Sun, Oct 30, 2011 at 4:35 PM, James Smith <james.sm...@m247.com> wrote:
> Ipv4.
>
> Regards,
>
> James
>
>
> -----Original Message-----
> From: linux-ha-boun...@lists.linux-ha.org 
> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Nick Khamis
> Sent: 30 October 2011 12:28
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>
> Are you using IPV4 or 6?
>
> Nick.
>
> On Sun, Oct 30, 2011 at 4:29 AM, James Smith <james.sm...@m247.com> wrote:
>> Well fileio hasn't solved the underlying issue, the SAN broke this morning 
>> at 6AM:
>>
>> Oct 30 06:01:19 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued 
>> on tid:1 lun:6 by sid:4222124721766912 (Function Complete) Oct 30
>> 06:01:20 iscsi1cl6 lrmd: [3770]: info: RA output:
>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>> 24 Oct 30 06:01:47 iscsi1cl6 last message repeated 27 times Oct 30
>> 06:01:48 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on 
>> tid:1
>> lun:6 by sid:5066549651898880 (Function Complete) Oct 30 06:01:48
>> iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 
>> by
>> sid:5066549651898880 (Function Complete) Oct 30 06:01:48 iscsi1cl6
>> lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted 
>> dotted-quad netmask to CIDR as: 24 Oct 30 06:02:19 iscsi1cl6 last 
>> message repeated 30 times Oct 30 06:02:48 iscsi1cl6 last message 
>> repeated 28 times Oct 30 06:02:49 iscsi1cl6 kernel: iscsi_trgt: Abort 
>> Task (01) issued on tid:1 lun:6 by sid:5066549651898880 (Function
>> Complete) Oct 30 06:02:49 iscsi1cl6 lrmd: [3770]: info: RA output:
>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>> 24 Oct 30 06:02:51 iscsi1cl6 last message repeated 2 times Oct 30
>> 06:02:52 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on 
>> tid:1
>> lun:6 by sid:4222124721766912 (Function Complete) Oct 30 06:02:52
>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) 
>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:17 
>> iscsi1cl6 last message repeated 24 times Oct 30 06:03:18 iscsi1cl6 kernel:
>> iscsi_trgt: cmnd_rx_start(1849) 1 3b000030 -7 Oct 30 06:03:18
>> iscsi1cl6 kernel: iscsi_trgt: cmnd_skip_pdu(459) 3b000030 1 2a 4096 
>> Oct 30 06:03:18 iscsi1cl6 lrmd: [3770]: info: RA output:
>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>> 24 Oct 30 06:03:49 iscsi1cl6 last message repeated 30 times Oct 30
>> 06:04:32 iscsi1cl6 last message repeated 42 times Oct 30 06:04:33
>> iscsi1cl6 cib: [3769]: info: cib_stats: Processed 1 operations 
>> (10000.00us average, 0% utilization) in the last 10min Oct 30 
>> 06:04:33
>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) 
>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:05:04 
>> iscsi1cl6 last message repeated 30 times Oct 30 06:05:41 iscsi1cl6 
>> last message repeated 36 times Oct 30 06:05:42 iscsi1cl6 kernel: 
>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by 
>> sid:5629499605320192 (Function
>> Complete)
>>
>>
>> Regards,
>>
>> James
>>
>> -----Original Message-----
>> From: linux-ha-boun...@lists.linux-ha.org
>> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of James Smith
>> Sent: 30 October 2011 00:25
>> To: General Linux-HA mailing list
>> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>>
>> Hi,
>>
>> Changed nothing to my knowledge :p
>>
>> These boxes don't currently have fencing enabled.  I imagine the reboot is 
>> caused by a kernel panic, sysctl is set to reboot on this.
>>
>> There is one big 4TB LUN, used by several VMs on XenServer, each with 
>> multiple disks.
>>
>> In my quest to resolve, I have changed iet to use fileio instead of blockio 
>> and fiddled with some drbd performance related bits 
>> (http://www.drbd.org/users-guide/s-latency-tuning.html).
>>
>> If I'm woken up again tonight with this thing breaking it's going in the 
>> bin.  I'll probably also ditch ietd and try open-iscsi or iscsi-scst.  
>> Monday morning I'll be shifting some load off this cluster also.
>>
>> Regards,
>>
>> James
>>
>> -----Original Message-----
>> From: linux-ha-boun...@lists.linux-ha.org
>> [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andreas 
>> Kurz
>> Sent: 29 October 2011 22:36
>> To: linux-ha@lists.linux-ha.org
>> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>>
>> Hello,
>>
>> On 10/29/2011 08:47 PM, James Smith wrote:
>>> Hi,
>>>
>>> All of a sudden, a SAN pair which was running without any problems for six 
>>> months, now decides to fall over every couple of hours.
>>
>> So what did you change? ;-)
>>
>>>
>>> The logs I have to go on are below:
>>>
>>> Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29
>>> 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
>>> tid:1
>>> lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24
>>> iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) 
>>> Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49
>>> iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel:
>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6
>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6
>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6
>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel:
>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6
>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last 
>>> message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt:
>>> Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 
>>> (Function
>>> Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task
>>> (01) issued on tid:1 lun:6 by sid:844424967684608 (Function 
>>> Complete) Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output:
>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>> 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29
>>> 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
>>> tid:1
>>> lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06
>>> iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6
>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected!
>>> [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 
>>> 29
>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent 
>>> local write detected! [DISCARD L] new: 2077806184s +512; pending:
>>> 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0:
>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>> 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06
>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write 
>>> detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s
>>> +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695]
>>> Concurrent local write detected! [DISCARD L] new: 1693425321s +3584;
>>> pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block
>>> drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>> 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 
>>> iscsi2cl6
>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected!
>>> [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct 
>>> 29
>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent 
>>> local write detected! [DISCARD L] new: 1693425320s +512; pending:
>>> 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0:
>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>> 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06
>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write 
>>> detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s
>>> +512
>>
>> Concurrent local writes .... Is there any kind of cluster software using a 
>> shared quorum disk or sthg. like that using this lun? Or this lun shared 
>> between several VMWare ESX VMs?
>>
>>> Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output:
>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>> 24
>>>
>>> After this event, both members of the SAN pair reboot.  It is very 
>>> disruptive, as it's killing the VMs using this SAN, requiring fsck's after 
>>> failure.  The load on the SAN doesn't need to be very high for this happen.
>>>
>>
>> They reboot because of a kernel panic, or because of some fencing mechanism?
>>
>>> Running the following:
>>>
>>> CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker
>>> 1.0.11-1.2.el5 DRBD 8.3.11
>>
>> Would be interesting to see Pacemaker/DRBD/IET config ....
>>
>> Regards,
>> Andreas
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to