[tickets] [opensaf:tickets] #1112 mds: immnd crashes and massive fevs duplicate messages seen

A V Mahesh (AVM) Tue, 23 Sep 2014 22:51:23 -0700

- **status**: unassigned --> fixed
- **assigned_to**: A V Mahesh (AVM)
- **Comment**:


1) I am not able to Reproduce the problem with TIPC 2.0 Suse

2)  The fevs duplicates symptom is appears to be due to a bug in TIPC.

Fixed by this TIPC patch:

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/net/tipc?id=999417549c16dd0e3a382aa9f6ae61688db03181
 The problem has been reported on both TIPC 1.7 &  TIPC 2.0 , so this patch has 
to be applied on both TIPC  1.7 & 2.0, but the official TIPC patch only 
available only for TIPC 1.7   and we are waiting for an official patch for TIPC 
2.0 form TIPC forum.

3) Following path is a work around patch for  TIPC 2.0  user until official 
patch available for TIPC 2.0 from TIPC forum.

------------------------------------------------
diff --git a/bcast.c b/bcast.c
--- a/bcast.c
+++ b/bcast.c
@@ -496,6 +496,7 @@ receive:
buf = deferred;
msg = buf_msg(buf);
node->bclink.deferred_head = deferred->next;
+ deferred->next = NULL;
goto receive;
}
return;
----------------------------------------------

4)  I build  New  TIPC 2.0  `tipc.ko`  with ( above patch ) the TIPC equivalent 
Fixed provide  for TIPC 1.7 
(https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/net/tipc?id=999417549c16dd0e3a382aa9f6ae61688db03181)
   , and   it is working fine,  no issue found and following are test I 
performed with NEW tipc.ko

        - Multiple fail-over (  #1112  issue is not reproducible )
        - Loaded/Synced 70 k objects with 4 node setup
        - multiple loading & syncing
        - Multiple fail-over while payload is syncing ( this covers  fail-overs 
          while  flood of Mcast traffic is occurring in cluster)

5) I suggest to reproduce the problem with TIPC patch .
6) Currently I am closing the issue , if problem is reproducible even with TIPC 
Fixed provide by this link 
(https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/net/tipc?id=999417549c16dd0e3a382aa9f6ae61688db03181)
 , issue can be reopened .




---

** [tickets:#1112] mds: immnd crashes and massive fevs duplicate messages seen**

**Status:** fixed
**Milestone:** 4.5.0
**Created:** Thu Sep 18, 2014 11:07 AM UTC by surender khetavath
**Last Updated:** Mon Sep 22, 2014 06:48 AM UTC
**Owner:** A V Mahesh (AVM)

changeset : 5697

As part of failovers the crash was observed

gdb on sc-1
(gdb) dir /home/staging/osaf/services/saf/immsv/immnd
Source directories searched: 
/home/staging/osaf/services/saf/immsv/immnd:$cdir:$cwd
(gdb) bt
#0  0x00007f91649a0b55 in raise () from /lib64/libc.so.6
#1  0x00007f91649a2131 in abort () from /lib64/libc.so.6
#2  0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
#3  0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
#4  0x0000000000418686 in immnd_process_evt () at immnd_evt.c:8152
#5  0x000000000040b83b in main () at immnd_main.c:343
(gdb) fr 2
#2  0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
2184                abort();
(gdb) fr 3
#3  0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
1805        ImmModel::instance(&cb->immModel)->prepareForSync(isJoining);

gdb on sc-2,pl3&4
#0  0x00007f753fae3b55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f753fae3b55 in raise () from /lib64/libc.so.6
#1  0x00007f753fae5131 in abort () from /lib64/libc.so.6
#2  0x0000000000418e40 in immnd_process_evt () at immnd_evt.c:8167
#3  0x000000000040b83b in main () at immnd_main.c:343


syslog on sc-1
Sep 18 13:54:53 SC-1 osafimmnd[2298]: ER Node is in a state that cannot accept 
start of sync, will terminate
Sep 18 13:54:53 SC-1 osafimmd[2288]: WA IMMND DOWN on active controller f2 
detected at standby immd!! f1. Possible failover
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER Standby IMMD recieved reset message. 
All IMMNDs will restart.
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER IMM RELOAD  => ensure cluster restart 
by IMMD exit at both SCs, exiting
Sep 18 13:54:59 SC-1 kernel: [   54.360115] eth3: no IPv6 routers present
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Node Down event for node id 2020f:
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Current role: STANDBY
Sep 18 13:54:59 SC-1 osaffmd[2278]: Rebooting OpenSAF NodeId = 0 EE Name = No 
EE Mapped, Reason: Failover occurred, but this node is not yet ready, OwnNodeId 
= 131343, SupervisionTime = 60
Sep 18 13:54:59 SC-1 kernel: [   54.680115] TIPC: Resetting link 
<1.1.1:eth3-1.1.2:eth2>, peer not responding
Sep 18 13:54:59 SC-1 kernel: [   54.680128] TIPC: Lost link 
<1.1.1:eth3-1.1.2:eth2> on network plane A
Sep 18 13:54:59 SC-1 kernel: [   54.680137] TIPC: Lost contact with <1.1.2>



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1112 mds: immnd crashes and massive fevs duplicate messages seen

Reply via email to