Hello, On 10/29/2011 08:47 PM, James Smith wrote: > Hi, > > All of a sudden, a SAN pair which was running without any problems for six > months, now decides to fall over every couple of hours.
So what did you change? ;-) > > The logs I have to go on are below: > > Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times > Oct 29 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:844424967684608 (Function Complete) > Oct 29 19:09:24 iscsi2cl6 lrmd: [4677]: info: RA output: > (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 > Oct 29 19:09:49 iscsi2cl6 last message repeated 24 times > Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:1125899927618048 (Function Complete) > Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:1407374904328704 (Function Complete) > Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:281474997486080 (Function Complete) > Oct 29 19:09:50 iscsi2cl6 lrmd: [4677]: info: RA output: > (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 > Oct 29 19:09:50 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:562949974196736 (Function Complete) > Oct 29 19:09:51 iscsi2cl6 lrmd: [4677]: info: RA output: > (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 > Oct 29 19:09:53 iscsi2cl6 last message repeated 2 times > Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:844424967684608 (Function Complete) > Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:844424967684608 (Function Complete) > Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: > (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 > Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times > Oct 29 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:1407374904328704 (Function Complete) > Oct 29 19:10:06 iscsi2cl6 last message repeated 4 times > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 2077806184s +512; pending: 2077806184s +512 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1693425337s +3584; pending: 1693425337s +3584 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s +512 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1693425321s +3584; pending: 1693425321s +3584 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1693425328s +512; pending: 1693425328s +512 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1693425320s +512; pending: 1693425320s +512 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1743088585s +3584; pending: 1743088585s +3584 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s +512 Concurrent local writes .... Is there any kind of cluster software using a shared quorum disk or sthg. like that using this lun? Or this lun shared between several VMWare ESX VMs? > Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: > (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 > > After this event, both members of the SAN pair reboot. It is very > disruptive, as it's killing the VMs using this SAN, requiring fsck's after > failure. The load on the SAN doesn't need to be very high for this happen. > They reboot because of a kernel panic, or because of some fencing mechanism? > Running the following: > > CentOS 5 with kernel 2.6.18-274.7.1.el5 > IET 1.4.20.2 > Pacemaker 1.0.11-1.2.el5 > DRBD 8.3.11 Would be interesting to see Pacemaker/DRBD/IET config .... Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems