Regarding the rcu stall backtrace, I don'think the deadlock was caused by TIPC module. Instead I think the reason why the link between active machine and standby machine was lost is that standby machine was dead due to rcu stall. Therefore, it's key to find out its root cause by looking into rcu stall reason.
However, according to the following rcu stall backtrace, it doesn't provide very meaningful information for us. So I don't understand why the rcu stall occurred. Regards, Ying From: Sumit Gemini [mailto:sumitgem...@users.sf.net] Sent: Friday, February 17, 2017 4:51 PM To: Ticket 122 Subject: [tipc:bugs] #122 TIPC link down ________________________________ [bugs:#122]<https://sourceforge.net/p/tipc/bugs/122/> TIPC link down Status: open Group: Labels: tipc rcu_bh_state Created: Fri Feb 17, 2017 08:50 AM UTC by Sumit Gemini Last Updated: Fri Feb 17, 2017 08:50 AM UTC Owner: Erik Hugne Hi All, I have HA pair, and i observed tipc link lost event was not received by standby machine. I got this problem on ACTIVE machine : Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308014] TIPC: Resetting link <1.1.2:bond0-1.1.1:bond0>, peer not responding Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308021] TIPC: Lost link <1.1.2:bond0-1.1.1:bond0> on network plane A Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308026] TIPC: Lost contact with <1.1.1> Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Node Down event for node id 2010f: Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Done Locking applications on node id:2010f ret val:0 Jan 6 16:45:01 ffm-sbc-2b osafclmd[4963]: NO Node 131343 went down. Not sending track callback for agents on that node Jan 6 16:45:01 ffm-sbc-2b osafclmd: Last message 'NO Node 131343 went ' repeated 5 times, suppressed by syslog-ng on ffm-sbc-2b.mydomain.com Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Current role: ACTIVE Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60 Jan 6 16:45:01 ffm-sbc-2b osafamfd[4986]: NO Node 'SC-1' left the cluster Jan 6 16:45:01 ffm-sbc-2b osafimmd[4910]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN) Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Global discard node received for nodeId:2010f pid:5047 Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 121 <0, 2010f(down)> (MsgQueueService131343) Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 120 <0, 2010f(down)> (@safAmfService2010f) Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer connected: 122 (MsgQueueService131343) <592, 2020f> Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer locally disconnected. Marking it as doomed 122 <592, 2020f> (MsgQueueService131343) Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 122 <592, 2020f> (MsgQueueService131343) Jan 6 16:45:01 ffm-sbc-2b opensaf_reboot: No lock is in progress going to process further... On standby machine : I observed rcu_bh_state, and kernel stack dumo when TIPC lost link was occured on ACTIVE machine and after 6 sec we got link lost message on standby machine. Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.520060] INFO: rcu_bh_state detected stall on CPU 0 (t=0 jiffies) Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] sending NMI to all CPUs: Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] NMI backtrace for cpu 0 Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] CPU 0 Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Modules linked in: nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspk r mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP] Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Pid: 0, comm: swapper Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RIP: 0010:[<ffffffff81007f51>] [<ffffffff81007f51>] native_read_tsc+0x2/0xf Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RSP: 0018:ffff88043ee03db0 EFLAGS: 00000803 Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RAX: 0000000037185395 RBX: 00000000000003e9 RCX: 0000000000000001 Jan 6 16:45:07 ffm-sbc-2a osafimmd[5035]: WA IMMND DOWN on active controller f2 detected at standby immd!! f1. Possible failover Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Done Locking applications on node id:2020f ret val:0 Jan 6 16:45:07 ffm-sbc-2a opensaf_recovery: Control interface status:0 Role:STANDBY Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Current role: STANDBY Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RDX: 0000000000bf0977 RSI: 0000000000000002 RDI: 0000000000032bd4 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RBP: 0000000000032bd4 R08: 0000000000000000 R09: ffffffff819232b0 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R10: 7fffffffffffffff R11: 7fffffffffffffff R12: 0000000000000000 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R13: ffffffff819232b0 R14: 0000000000000001 R15: ffffffff81065c28 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] FS: 0000000000000000(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CR2: 000000000069e034 CR3: 0000000001805000 CR4: 00000000000406f0 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180d020) Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Stack: Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] ffffffff81200eb5 ffffffff81200f44 00000000000003e9 0000000000001000 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 ffffffff819232b0 ffffffff81017698 7fffffffffffffff Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 0000000000000002 ffffffff81017fdf 0000000000000001 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Call Trace: Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200eb5>] paravirt_read_tsc+0x5/0x8 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200f44>] delay_tsc+0x1d/0x68 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017698>] native_safe_apic_wait_icr_idle+0x27/0x32 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017fdf>] default_send_IPI_dest_field.constprop.0+0x19/0x4d Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8101804b>] default_send_IPI_mask_sequence_phys+0x38/0x6a Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8101815e>] arch_trigger_all_cpu_backtrace+0x4d/0x7b Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109801b>] check_cpu_stall+0x66/0xdb Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff810980aa>] rcu_pending+0x1a/0x10a Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109852c>] rcu_check_callbacks+0x9d/0xae Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8104c56c>] update_process_times+0x31/0x63 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81065c92>] tick_sched_timer+0x6a/0x90 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105b872>] __run_hrtimer+0xa4/0x148 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105c08e>] hrtimer_interrupt+0xdb/0x19a Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017768>] smp_apic_timer_interrupt+0x6e/0x80 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff813efcde>] apic_timer_interrupt+0x6e/0x80 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81239655>] intel_idle+0xdd/0x104 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81304773>] cpuidle_idle_call+0xda/0x169 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81001200>] cpu_idle+0x51/0x95 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193db0f>] start_kernel+0x388/0x393 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193d3af>] x86_64_start_kernel+0xcf/0xdc Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Code: 74 03 e6 80 c3 e6 ed c3 bf 8e 21 00 00 e9 ba 8f 1f 00 c3 90 90 90 40 88 f8 e6 70 e4 71 c3 40 88 f0 e6 70 40 88 f8 e6 71 c3 0f 31 Jan 6 16:45:07 ffm-sbc-2a kernel[3167216.524042]: c1 48 89 d0 48 c1 e0 20 48 09 c8 c3 41 56 41 55 41 54 55 53 . . . . . . Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] NMI backtrace for cpu 31 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CPU 31 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Modules linked in: nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspk r mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP] Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Pid: 0, comm: kworker/0:1 Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RIP: 0010:[<ffffffff81239624>] [<ffffffff81239624>] intel_idle+0xac/0x104 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RSP: 0018:ffff880425e33eb8 EFLAGS: 00000046 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RAX: 0000000000000030 RBX: 0000000000000010 RCX: 0000000000000001 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RDX: 0000000000000000 RSI: ffff880425e33fd8 RDI: ffffffff81810500 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RBP: 0000000000000030 R08: 000000000000006d R09: 0000000000034b10 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R10: ffff88083eded830 R11: ffff88083eded830 R12: 149739342cb2ca49 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R13: 0000000000000004 R14: 000000000000001f R15: 0000000000000000 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] FS: 0000000000000000(0000) GS:ffff88083ede0000(0000) knlGS:0000000000000000 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CR2: 00007fe89df67120 CR3: 0000000001805000 CR4: 00000000000406e0 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Process kworker/0:1 (pid: 0, threadinfo ffff880425e32000, task ffff880425e30580) Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Stack: Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685484] TIPC: Resetting link <1.1.1:bond0-1.1.2:bond0>, requested by peer Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685487] TIPC: Lost link <1.1.1:bond0-1.1.2:bond0> on network plane A Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685491] TIPC: Lost contact with <1.1.2> Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 000000000cdd3a47 0000000000000000 000000000cdd3a47 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] ffff880425e33fd8 0000001f3edf8970 ffff88083edf8970 ffff88083edf8b00 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 ffffffff81304773 ffffffff819232b0 ffff880425e33fd8 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace: Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>] cpuidle_idle_call+0xda/0x169 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>] cpu_idle+0x51/0x95 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Code: 28 e0 ff ff 80 e2 08 75 22 31 d2 48 83 c0 10 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 48 89 e8 0f 01 c9 <e8> 3f 6e e2 ff 4c 29 e0 48 89 c7 e8 10 ae e0 ff 48 69 e8 40 42 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace: Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>] cpuidle_idle_call+0xda/0x169 Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>] cpu_idle+0x51/0x95 Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting remote node in the absence of PLM with custom handling through 62.115.30.49 Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting peer node... Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooted peer node! can someone help me on it. Please tell me why this issue occured. Thanks ~Sumit Gemini ________________________________ Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/tipc/bugs/122/ To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion