Re: [Linux-HA] SANs falling over, don't know why!

Andreas Kurz Sat, 29 Oct 2011 14:36:28 -0700

Hello,

On 10/29/2011 08:47 PM, James Smith wrote:
> Hi,
> 
> All of a sudden, a SAN pair which was running without any problems for six 
> months, now decides to fall over every couple of hours.


So what did you change? ;-)

> 
> The logs I have to go on are below:
> 
> Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times
> Oct 29 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:844424967684608 (Function Complete)
> Oct 29 19:09:24 iscsi2cl6 lrmd: [4677]: info: RA output: 
> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
> Oct 29 19:09:49 iscsi2cl6 last message repeated 24 times
> Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:1125899927618048 (Function Complete)
> Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:1407374904328704 (Function Complete)
> Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:281474997486080 (Function Complete)
> Oct 29 19:09:50 iscsi2cl6 lrmd: [4677]: info: RA output: 
> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
> Oct 29 19:09:50 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:562949974196736 (Function Complete)
> Oct 29 19:09:51 iscsi2cl6 lrmd: [4677]: info: RA output: 
> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
> Oct 29 19:09:53 iscsi2cl6 last message repeated 2 times
> Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:844424967684608 (Function Complete)
> Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:844424967684608 (Function Complete)
> Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: 
> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
> Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times
> Oct 29 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:1407374904328704 (Function Complete)
> Oct 29 19:10:06 iscsi2cl6 last message repeated 4 times
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 2077806184s +512; pending: 2077806184s +512
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1693425337s +3584; pending: 1693425337s +3584
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s +512
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1693425321s +3584; pending: 1693425321s +3584
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1693425328s +512; pending: 1693425328s +512
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1693425320s +512; pending: 1693425320s +512
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1743088585s +3584; pending: 1743088585s +3584
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s +512

Concurrent local writes .... Is there any kind of cluster software using
a shared quorum disk or sthg. like that using this lun? Or this lun
shared between several VMWare ESX VMs?

> Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: 
> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
> 
> After this event, both members of the SAN pair reboot.  It is very 
> disruptive, as it's killing the VMs using this SAN, requiring fsck's after 
> failure.  The load on the SAN doesn't need to be very high for this happen.
> 

They reboot because of a kernel panic, or because of some fencing mechanism?

> Running the following:
> 
> CentOS 5 with kernel 2.6.18-274.7.1.el5
> IET 1.4.20.2
> Pacemaker 1.0.11-1.2.el5
> DRBD 8.3.11

Would be interesting to see Pacemaker/DRBD/IET config ....

Regards,
Andreas
-- 
Need help with Pacemaker?
http://www.hastexo.com/now

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] SANs falling over, don't know why!

Reply via email to