Regarding the rcu stall backtrace, I don'think the deadlock was caused by TIPC 
module. Instead I think the reason why the link between active machine and 
standby machine was lost is that standby machine was dead due to rcu stall. 
Therefore, it's key to find out its root cause by looking into rcu stall reason.

However, according to the following rcu stall backtrace, it doesn't provide 
very meaningful information for us. So I don't understand why the rcu stall 
occurred.

Regards,
Ying

From: Sumit Gemini [mailto:sumitgem...@users.sf.net]
Sent: Friday, February 17, 2017 4:51 PM
To: Ticket 122
Subject: [tipc:bugs] #122 TIPC link down

________________________________

[bugs:#122]<https://sourceforge.net/p/tipc/bugs/122/> TIPC link down

Status: open
Group:
Labels: tipc rcu_bh_state
Created: Fri Feb 17, 2017 08:50 AM UTC by Sumit Gemini
Last Updated: Fri Feb 17, 2017 08:50 AM UTC
Owner: Erik Hugne

Hi All,

I have HA pair, and i observed tipc link lost event was not received by standby 
machine. I got this problem

on ACTIVE machine :

Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308014] TIPC: Resetting link 
<1.1.2:bond0-1.1.1:bond0>, peer not responding
Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308021] TIPC: Lost link 
<1.1.2:bond0-1.1.1:bond0> on network plane A
Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308026] TIPC: Lost contact with 
<1.1.1>
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Node Down event for node id 2010f:
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Done Locking applications on node 
id:2010f ret val:0
Jan 6 16:45:01 ffm-sbc-2b osafclmd[4963]: NO Node 131343 went down. Not sending 
track callback for agents on that node
Jan 6 16:45:01 ffm-sbc-2b osafclmd: Last message 'NO Node 131343 went ' 
repeated 5 times, suppressed by syslog-ng on ffm-sbc-2b.mydomain.com
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Current role: ACTIVE
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: Rebooting OpenSAF NodeId = 131343 EE 
Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, 
SupervisionTime = 60
Jan 6 16:45:01 ffm-sbc-2b osafamfd[4986]: NO Node 'SC-1' left the cluster
Jan 6 16:45:01 ffm-sbc-2b osafimmd[4910]: WA IMMD lost contact with peer IMMD 
(NCSMDS_RED_DOWN)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Global discard node received for 
nodeId:2010f pid:5047
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 121 <0, 
2010f(down)> (MsgQueueService131343)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 120 <0, 
2010f(down)> (@safAmfService2010f)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer connected: 122 
(MsgQueueService131343) <592, 2020f>
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer locally disconnected. 
Marking it as doomed 122 <592, 2020f> (MsgQueueService131343)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 122 
<592, 2020f> (MsgQueueService131343)
Jan 6 16:45:01 ffm-sbc-2b opensaf_reboot: No lock is in progress going to 
process further...

On standby machine :

I observed rcu_bh_state, and kernel stack dumo when TIPC lost link was occured 
on ACTIVE machine and after 6 sec we got link lost message on standby machine.

Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.520060] INFO: rcu_bh_state detected 
stall on CPU 0 (t=0 jiffies)
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] sending NMI to all CPUs:
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] NMI backtrace for cpu 0
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] CPU 0
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Modules linked in: 
nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport 
ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 
nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log 
ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate 
zlib_deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 
des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 
crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables 
iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc 
iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp 
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf 
ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp 
stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspk
 r mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) 
ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP]
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042]
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Pid: 0, comm: swapper 
Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RIP: 
0010:[<ffffffff81007f51>] [<ffffffff81007f51>] native_read_tsc+0x2/0xf
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RSP: 0018:ffff88043ee03db0 
EFLAGS: 00000803
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RAX: 0000000037185395 RBX: 
00000000000003e9 RCX: 0000000000000001
Jan 6 16:45:07 ffm-sbc-2a osafimmd[5035]: WA IMMND DOWN on active controller f2 
detected at standby immd!! f1. Possible failover
Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Done Locking applications on node 
id:2020f ret val:0
Jan 6 16:45:07 ffm-sbc-2a opensaf_recovery: Control interface status:0 
Role:STANDBY
Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Current role: STANDBY
Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: Rebooting OpenSAF NodeId = 131599 EE 
Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, 
SupervisionTime = 60
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RDX: 0000000000bf0977 RSI: 
0000000000000002 RDI: 0000000000032bd4
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RBP: 0000000000032bd4 R08: 
0000000000000000 R09: ffffffff819232b0
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R10: 7fffffffffffffff R11: 
7fffffffffffffff R12: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R13: ffffffff819232b0 R14: 
0000000000000001 R15: ffffffff81065c28
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] FS: 0000000000000000(0000) 
GS:ffff88043ee00000(0000) knlGS:0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CS: 0010 DS: 0000 ES: 0000 
CR0: 000000008005003b
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CR2: 000000000069e034 CR3: 
0000000001805000 CR4: 00000000000406f0
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Process swapper (pid: 0, 
threadinfo ffffffff81800000, task ffffffff8180d020)
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Stack:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] ffffffff81200eb5 
ffffffff81200f44 00000000000003e9 0000000000001000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 
ffffffff819232b0 ffffffff81017698 7fffffffffffffff
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 
0000000000000002 ffffffff81017fdf 0000000000000001
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Call Trace:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200eb5>] 
paravirt_read_tsc+0x5/0x8
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200f44>] 
delay_tsc+0x1d/0x68
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017698>] 
native_safe_apic_wait_icr_idle+0x27/0x32
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017fdf>] 
default_send_IPI_dest_field.constprop.0+0x19/0x4d
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
[<ffffffff8101804b>] default_send_IPI_mask_sequence_phys+0x38/0x6a
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
[<ffffffff8101815e>] arch_trigger_all_cpu_backtrace+0x4d/0x7b
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
[<ffffffff8109801b>] check_cpu_stall+0x66/0xdb
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
[<ffffffff810980aa>] rcu_pending+0x1a/0x10a
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109852c>] 
rcu_check_callbacks+0x9d/0xae
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8104c56c>] 
update_process_times+0x31/0x63
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81065c92>] 
tick_sched_timer+0x6a/0x90
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105b872>] 
__run_hrtimer+0xa4/0x148
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105c08e>] 
hrtimer_interrupt+0xdb/0x19a
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017768>] 
smp_apic_timer_interrupt+0x6e/0x80
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff813efcde>] 
apic_timer_interrupt+0x6e/0x80
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81239655>] 
intel_idle+0xdd/0x104
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81304773>] 
cpuidle_idle_call+0xda/0x169
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81001200>] 
cpu_idle+0x51/0x95
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193db0f>] 
start_kernel+0x388/0x393
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193d3af>] 
x86_64_start_kernel+0xcf/0xdc
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Code: 74 03 e6 80 c3 e6 ed 
c3 bf 8e 21 00 00 e9 ba 8f 1f 00 c3 90 90 90 40 88 f8 e6 70 e4 71 c3 40 88 f0 
e6 70 40 88 f8 e6 71 c3 0f 31
Jan 6 16:45:07 ffm-sbc-2a kernel[3167216.524042]: c1 48 89 d0 48 c1 e0 20 48 09 
c8 c3 41 56 41 55 41 54 55 53

.
.
.
.
.
.

Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] NMI backtrace for cpu 31
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CPU 31
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Modules linked in: 
nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport 
ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 
nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log 
ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate 
zlib_deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 
des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 
crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables 
iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc 
iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp 
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf 
ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp 
stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspk
 r mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) 
ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP]
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734]
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Pid: 0, comm: kworker/0:1 
Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RIP: 
0010:[<ffffffff81239624>] [<ffffffff81239624>] intel_idle+0xac/0x104
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RSP: 0018:ffff880425e33eb8 
EFLAGS: 00000046
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RAX: 0000000000000030 RBX: 
0000000000000010 RCX: 0000000000000001
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RDX: 0000000000000000 RSI: 
ffff880425e33fd8 RDI: ffffffff81810500
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RBP: 0000000000000030 R08: 
000000000000006d R09: 0000000000034b10
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R10: ffff88083eded830 R11: 
ffff88083eded830 R12: 149739342cb2ca49
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R13: 0000000000000004 R14: 
000000000000001f R15: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] FS: 0000000000000000(0000) 
GS:ffff88083ede0000(0000) knlGS:0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CS: 0010 DS: 0000 ES: 0000 
CR0: 000000008005003b
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CR2: 00007fe89df67120 CR3: 
0000000001805000 CR4: 00000000000406e0
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Process kworker/0:1 (pid: 0, 
threadinfo ffff880425e32000, task ffff880425e30580)
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Stack:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685484] TIPC: Resetting link 
<1.1.1:bond0-1.1.2:bond0>, requested by peer
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685487] TIPC: Lost link 
<1.1.1:bond0-1.1.2:bond0> on network plane A
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685491] TIPC: Lost contact with 
<1.1.2>
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 
000000000cdd3a47 0000000000000000 000000000cdd3a47
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] ffff880425e33fd8 
0000001f3edf8970 ffff88083edf8970 ffff88083edf8b00
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 
ffffffff81304773 ffffffff819232b0 ffff880425e33fd8
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>] 
cpuidle_idle_call+0xda/0x169
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>] 
cpu_idle+0x51/0x95
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Code: 28 e0 ff ff 80 e2 08 
75 22 31 d2 48 83 c0 10 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 
75 08 b1 01 48 89 e8 0f 01 c9 <e8> 3f 6e e2 ff 4c 29 e0 48 89 c7 e8 10 ae e0 ff 
48 69 e8 40 42
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>] 
cpuidle_idle_call+0xda/0x169
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>] 
cpu_idle+0x51/0x95
Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting remote node in the absence 
of PLM with custom handling through 62.115.30.49
Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting peer node...
Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooted peer node!

can someone help me on it. Please tell me why this issue occured.

Thanks
~Sumit Gemini

________________________________

Sent from sourceforge.net because you indicated interest in 
https://sourceforge.net/p/tipc/bugs/122/

To unsubscribe from further messages, please visit 
https://sourceforge.net/auth/subscriptions/
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to