Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
Hi Guna, Please see my comments below. Regards, Ying -Original Message- From: GUNA [mailto:gbala...@gmail.com] Sent: 2016年6月1日 23:26 To: Xue, Ying Cc: Jon Maloy; Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik Hugne; Xue Ying (ying.x...@gmail.com) Subject: Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0 If the issue is reproducible then I could try with Erik's patch even though the root cause is unknown. Currently, we are not clear yet on root cause of the issue. I could add the patch on production release, only if the patch will fix the issue. [Ying] Understood your meaning. I think Erik's patch should be merged into upstream whatever it can fix your issue or not. But in my opinion, it should be able to fix it. Otherwise, I may need to find test stream. As per Erik's proposal, == if (sock_owned_by_user(sk)) we can reschedule timer for a retry in a few jiffies == I tried to call sk_reset_timer(sk, >sk_timer, (HZ / 20)); but the code or similar already is in place at tipc_sk_timeout() as marked by "<<==" below" if (tsk->probing_state == TIPC_CONN_PROBING) { if (!sock_owned_by_user(sk)) ... else sk_reset_timer(sk, >sk_timer, (HZ / 20)); <<== } else { sk_reset_timer(sk, >sk_timer, jiffies + tsk->probing_intv); <<== } Please let me know If I need to add/modify any. [Ying] your change above is right and it should be workable. But I still suggest you should adopt Erik's patch("tipc: fix timer handling when socket is owned ") as it's much better than above solution. thanks, Guna On Wed, Jun 1, 2016 at 7:31 AM, Xue, Yingwrote: > Hi GUNA, > > Thanks for your confirmation, which is very important for us to look into > what happened in 4.4.0 version. > Yes, my mentioned Erik's patch is just as Erik said: "tipc: fix timer > handling when socket is owned". > > I also agree to Erik's solution as its change is more common method to deal > with the case when owner flag is not set in BH. > > But now we still need to know what root cause is the issue. > > If possible, please apply Erik's patch on your side to check whether the > issue occurs or not. > > Regards, > Ying > > -Original Message- > From: GUNA [mailto:gbala...@gmail.com] > Sent: 2016年5月31日 23:34 > To: Xue, Ying > Cc: Jon Maloy; Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik Hugne; > Xue Ying (ying.x...@gmail.com) > Subject: Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card > on 4.4.0 > > Just want to clarify, system was upgraded only the kernel from 3.4.2 > to 4.4.0 + some tipc patches on Fedora distribution. That said, the > patch, "net: do not block BH while processing socket backlog" is not > part of the 4.4.0. So, the issue is not due to this commit. > > If the patch, "tipc: block BH in TCP callbacks" could resolve the > issue then I could try applying the patch. However the issue is not > reproducible. So, we may not get the result right away. > > Which Erik's patch you are talking about? > Is this one, "tipc: fix timer handling when socket is owned" ? > > > /// Guna > > On Tue, May 31, 2016 at 3:49 AM, Xue, Ying wrote: >> Hi Jon, >> >> Today, I spent time further analyzing its potential root cause why the soft >> lockup occurred regarding log provided by GUNA. But I don't find some >> valuable hints. >> >> To be honest, even if CONN_MANAGER/CONN_PROBE message is sent through >> tipc_node_xmit_skb() without holding "owner" flag in tipc_sk_timeout(), >> deadlock should not happen in theory. Before the tipc_sk_rcv() is secondly >> called, destination port and source port of CONN_MANAGER/CONN_PROBE message >> created in tipc_sk_timeout() have been reversed. As a result, the tsk found >> at (*1) is different with another tsk found at (*2) because we use different >> destination number to look up tsk instances. >> >> tipc_sk_timeout() >> create: CONN_MANAGER/CONN_PROBE msg (src port= tsk->portid, dst port = >> peer_port) >> tipc_node_xmit_skb() >> tipc_node_xmit() >> tipc_sk_rcv() >> tsk = tipc_sk_lookup(net, dport); // use dst port(peer_port) to look >> up tsk, and the tsk is called tsk1 (*1) >> if (likely(spin_trylock_bh(>sk_lock.slock))) >> tipc_sk_enqueue() >> filter_rcv() >> tipc_sk_proto_rcv() >>tipc_sk_respond() >> reverse ports: dport = tsk->portid; src port = >> peer_port >> tipc_node_xmit_skb() >>tipc_node_xmit() >> tipc_sk_rcv() >> tsk = tipc_sk_lookup(net, dport); // use dst >> port(portid) to look up tsk, and the tsk is supposed as tsk2 --(*2) >> if (likely(spin_trylock_bh(>sk_lock.slock))) >> >> So even if "owner" flag of tsk1 is not set, it's safe for us to operate tsk2 >> in
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
On May 31, 2016 6:12 PM, "GUNA"wrote: > > Could you provide me the exact code change for rescheduling, so I > don't want to make any mistake. > Nope, I'm travelling now. But if you want to try the resched-timer-if-owned hack, use: sk_reset_timer(sk, >sk_timer, (HZ / 20)); at the appropriate place.. > Also, could I still apply the patch, "tipc: block BH in TCP callbacks" ? > No idea what this patch does. //E -- What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e ___ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
On May 31, 2016 17:34, "GUNA"wrote: > > Which Erik's patch you are talking about? > Is this one, "tipc: fix timer handling when socket is owned" ? I think he was referring to my earlier suggestion to reschedule the timer if the socket is owned by user when it fires. The patch i sent yesterday tries to solve it slightly different. Instead of rescheduling the timer, we set a flag and act on that when the sock is released by user. //E -- What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e ___ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
Just want to clarify, system was upgraded only the kernel from 3.4.2 to 4.4.0 + some tipc patches on Fedora distribution. That said, the patch, "net: do not block BH while processing socket backlog" is not part of the 4.4.0. So, the issue is not due to this commit. If the patch, "tipc: block BH in TCP callbacks" could resolve the issue then I could try applying the patch. However the issue is not reproducible. So, we may not get the result right away. Which Erik's patch you are talking about? Is this one, "tipc: fix timer handling when socket is owned" ? /// Guna On Tue, May 31, 2016 at 3:49 AM, Xue, Yingwrote: > Hi Jon, > > Today, I spent time further analyzing its potential root cause why the soft > lockup occurred regarding log provided by GUNA. But I don't find some > valuable hints. > > To be honest, even if CONN_MANAGER/CONN_PROBE message is sent through > tipc_node_xmit_skb() without holding "owner" flag in tipc_sk_timeout(), > deadlock should not happen in theory. Before the tipc_sk_rcv() is secondly > called, destination port and source port of CONN_MANAGER/CONN_PROBE message > created in tipc_sk_timeout() have been reversed. As a result, the tsk found > at (*1) is different with another tsk found at (*2) because we use different > destination number to look up tsk instances. > > tipc_sk_timeout() > create: CONN_MANAGER/CONN_PROBE msg (src port= tsk->portid, dst port = > peer_port) > tipc_node_xmit_skb() > tipc_node_xmit() > tipc_sk_rcv() > tsk = tipc_sk_lookup(net, dport); // use dst port(peer_port) to look > up tsk, and the tsk is called tsk1 (*1) > if (likely(spin_trylock_bh(>sk_lock.slock))) > tipc_sk_enqueue() > filter_rcv() > tipc_sk_proto_rcv() >tipc_sk_respond() > reverse ports: dport = tsk->portid; src port = peer_port > tipc_node_xmit_skb() >tipc_node_xmit() > tipc_sk_rcv() > tsk = tipc_sk_lookup(net, dport); // use dst > port(portid) to look up tsk, and the tsk is supposed as tsk2 --(*2) > if (likely(spin_trylock_bh(>sk_lock.slock))) > > So even if "owner" flag of tsk1 is not set, it's safe for us to operate tsk2 > in the same BH context. > > I also agree with you. Although Erik's patch might solve the issue, we still > need to further find its root cause. > > Additionally, I suspect there is a certain relationship between the issue and > 5413d1babe8f10de13d72496c12b862eef8ba613 (net: do not block BH while > processing socket backlog). > I don't know whether it's easy to reproduce the issue. But I suggest we can > revert above commit or apply the following patch to verify whether the issue > is related to the commit. > > http://www.spinics.net/lists/netdev/msg378109.html > > Regards, > Ying > > -Original Message- > From: Jon Maloy [mailto:jon.ma...@ericsson.com] > Sent: 2016年5月30日 22:43 > To: Xue, Ying; GUNA; Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik > Hugne; Xue Ying (ying.x...@gmail.com) > Subject: RE: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card > on 4.4.0 > > > >> -Original Message- >> From: Xue, Ying [mailto:ying@windriver.com] >> Sent: Monday, 30 May, 2016 14:15 >> To: Jon Maloy; GUNA; Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik >> Hugne; Xue Ying (ying.x...@gmail.com) >> Subject: RE: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card >> on 4.4.0 >> >> Hi Jon, >> > ... > >> In fact we ever proposed several possible solutions of how to well deal with >> the >> concurrent scenario in skb receive path. Especially, we need to stay BH mode >> to >> directly forward back a skb received in BH, which very easily causes the >> problem >> that a skb will be routed back and forth, leading deadlock occurs. > > We stay in BH, but we do *not* keep slock during the send/receive sequence, > so this cannot be the cause of the problem. On the receive path, this should > work just and any other message received in softirq. > > Anyway, I am not sure at all this is a deadlock concerning slock (see earlier > mail), but rather a message arriving to a socket that has been corrupted by > interference between the timer in softirq and tipc_recv_xxx() in user > context. Maybe a state change? > > I think a slightly improved version of what Erik suggested might solve the > issue, but I won't feel comfortable until we understand completely what is > happening. > > ///jon > >> Unfortunately, >> so far it seems that we are still unable to perfectly solve the issue. >> Therefore, >> maybe we have to radically change the current mechanism of receiving skb and >> forward skb on BH mode. >> >> Regards, >> Ying >> >> -Original Message- >> From: Jon Maloy [mailto:jon.ma...@ericsson.com] >> Sent: 2016年5月30日 5:32 >> To:
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
> -Original Message- > From: Xue, Ying [mailto:ying@windriver.com] > Sent: Monday, 30 May, 2016 14:15 > To: Jon Maloy; GUNA; Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik > Hugne; Xue Ying (ying.x...@gmail.com) > Subject: RE: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card > on 4.4.0 > > Hi Jon, > ... > In fact we ever proposed several possible solutions of how to well deal with > the > concurrent scenario in skb receive path. Especially, we need to stay BH mode > to > directly forward back a skb received in BH, which very easily causes the > problem > that a skb will be routed back and forth, leading deadlock occurs. We stay in BH, but we do *not* keep slock during the send/receive sequence, so this cannot be the cause of the problem. On the receive path, this should work just and any other message received in softirq. Anyway, I am not sure at all this is a deadlock concerning slock (see earlier mail), but rather a message arriving to a socket that has been corrupted by interference between the timer in softirq and tipc_recv_xxx() in user context. Maybe a state change? I think a slightly improved version of what Erik suggested might solve the issue, but I won't feel comfortable until we understand completely what is happening. ///jon > Unfortunately, > so far it seems that we are still unable to perfectly solve the issue. > Therefore, > maybe we have to radically change the current mechanism of receiving skb and > forward skb on BH mode. > > Regards, > Ying > > -Original Message- > From: Jon Maloy [mailto:jon.ma...@ericsson.com] > Sent: 2016年5月30日 5:32 > To: GUNA; Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik Hugne; Xue, > Ying; Xue Ying (ying.x...@gmail.com) > Subject: RE: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card > on 4.4.0 > > Hi Guna, > I am looking at it, but don't have much time to spend at it right now. > A further study of your dump makes me believe this is a case of race between > tipc_recv_stream(), which in user context is setting the "owned" flag but not > grabbing slock, and tipc_sk_timeout(), which in interrupt mode is grabbing > slock > without checking the "owned" flag. This creates an obvious risk of a race, > but I > don't yet understand the exact implications or solution to it. It may be that > it is > sufficient to just replace the grabbing of slock with a call to lock_sock() > in the > timer function as I first suggested, but I rather believe we have to > explicitly grab > both slock and the flag and keep them for the duration of the timeout, before > tipc_node_xmit_skb(). This will protect the timeout against interference from > both user context (tipc_recv_stream() and other sw interrups (tipc_sk_rcv()) > and > vice versa. Or maybe not even this is enough? > > I will continue my analysis, but input from others would be appreciated. > > ///jon > > > > -Original Message- > > From: GUNA [mailto:gbala...@gmail.com] > > Sent: Saturday, 28 May, 2016 06:00 > > To: Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik Hugne > > Subject: Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card > > on > 4.4.0 > > > > Any update on the issue? Any other thoughts or possible fix ? > > > > The issue was seen on slot12 (1.1.12) node only. The other slots were up. > > > > I got the full logs as listed here: > > > > > > May 19 05:03:01 [SEQ 248049] dcsx5testslot13 /USR/SBIN/CROND[11359]: > > (root) CMD (/opt/cpu_ss7gw/current/scripts/mgmt_apache_watchdog) > > May 19 05:03:21 [SEQ 249182] dcsx5testslot12 kernel: [673637.606852] > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0] > > May 19 05:03:21 [SEQ 249183] dcsx5testslot12 kernel: [673637.607791] > > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0] > > May 19 05:03:21 [SEQ 249184] dcsx5testslot12 kernel: [673637.607817] > > Modules linked in: nf_log_ipv4 nf_log_common xt_LOG sctp libcrc32c > > e1000e tip > > c udp_tunnel ip6_udp_tunnel 8021q garp iTCO_wdt xt_physdev > > br_netfilter bridge stp llc nf_conntrack_ipv4 ipmiq_drv(O) sio_mmc(O) > > nf_defrag_ipv4 ip6 > > t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state > > nf_conntrack event_drv(O) ip6table_filter ip6_tables ddi(O) lockd > > pt_timer_info(O > > ) grace ixgbe usb_storage igb pcspkr iTCO_vendor_support i2c_algo_bit > > i2c_i801 ptp i2c_core ioatdma lpc_ich mfd_core tpm_tis dca pps_core > > intel_ips > > tpm mdio sunrpc [last unloaded: iTCO_wdt] > > May 19 05:03:21 [SEQ 249185] dcsx5testslot12 kernel: [673637.607819] > > CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O4.4.0 #17 > > May 19 05:03:21 [SEQ 249186] dcsx5testslot12 kernel: [673637.607820] > > Hardware name: PT AMC124/Base Board Product Name, BIOS > > LGNAJFIP.PTI.0012.P15 > > 01/15/2014 > > May 19 05:03:21 [SEQ 249187] dcsx5testslot12 kernel: [673637.607821] > > task: 880351a88000 ti: 880351a84000 task.ti:
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
Hi Jon, First of all, slock lock is designed very specially and wisely. In process context, it's similar to a mutex. By contrast, in interrupt context, it likes a spin lock. Moreover, it can safely protect members of sock struct on both contexts. When we are in interrupt/softirq mode and the "owned" flag is set, we know that sk_receive_queue is being operated by a process. At the moment we have to put skb buffer into backlog queue. When release_lock() is called by that process who set "owner" flag, all skb buffers will be moved from backlog queue to sk_recevie_queue. Regarding the scenario of slock protection, there are three different race conditions: 1. process vs process contexts, that is, two processes running in different CPUs concurrently access a sock instance. In this case, slock can is used as mutex to protect sock struct. For example, one process is check "owned" flag is set, it will be slept until the flag is unset by another process. 2. process vs softirq contexts. In process, once "owned" flag is set, slock spin lock is released immediately. In softirq context, it first grabs slock, and then checks whether the flag is set or not. If it's set, it will try later on, or put skb into backlog queue and then exit. 3. softirq vs softirq contexts. Slock plays the role of spin lock semantics in this case. Moreover, it's not necessary to explicitly disable BH under this mode as we already know we are in BH mode, and BH is disabled when softirq's handler is called. This also means that all functions called in BH are nonreentarnt. Regarding the principles above, firstly we cannot use lock_sock() tipc_sk_timeout() as it is called in BH mode. If we really want to do, deadlock will definitely happen as might_sleep() would be called in BH. Secondly, it's unnecessary to set "owned" flag in tipc_sk_timeout() as it's run in BH mode. In my opinion, the fundamental method to avoid the issue should be to ensure any functions called in BH cannot be iteratively invoked , otherwise, it's hard to definitely prevent the issue from happening again in other similar scenarios. In fact we ever proposed several possible solutions of how to well deal with the concurrent scenario in skb receive path. Especially, we need to stay BH mode to directly forward back a skb received in BH, which very easily causes the problem that a skb will be routed back and forth, leading deadlock occurs. Unfortunately, so far it seems that we are still unable to perfectly solve the issue. Therefore, maybe we have to radically change the current mechanism of receiving skb and forward skb on BH mode. Regards, Ying -Original Message- From: Jon Maloy [mailto:jon.ma...@ericsson.com] Sent: 2016年5月30日 5:32 To: GUNA; Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik Hugne; Xue, Ying; Xue Ying (ying.x...@gmail.com) Subject: RE: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0 Hi Guna, I am looking at it, but don't have much time to spend at it right now. A further study of your dump makes me believe this is a case of race between tipc_recv_stream(), which in user context is setting the "owned" flag but not grabbing slock, and tipc_sk_timeout(), which in interrupt mode is grabbing slock without checking the "owned" flag. This creates an obvious risk of a race, but I don't yet understand the exact implications or solution to it. It may be that it is sufficient to just replace the grabbing of slock with a call to lock_sock() in the timer function as I first suggested, but I rather believe we have to explicitly grab both slock and the flag and keep them for the duration of the timeout, before tipc_node_xmit_skb(). This will protect the timeout against interference from both user context (tipc_recv_stream() and other sw interrups (tipc_sk_rcv()) and vice versa. Or maybe not even this is enough? I will continue my analysis, but input from others would be appreciated. ///jon > -Original Message- > From: GUNA [mailto:gbala...@gmail.com] > Sent: Saturday, 28 May, 2016 06:00 > To: Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik Hugne > Subject: Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card > on 4.4.0 > > Any update on the issue? Any other thoughts or possible fix ? > > The issue was seen on slot12 (1.1.12) node only. The other slots were up. > > I got the full logs as listed here: > > > May 19 05:03:01 [SEQ 248049] dcsx5testslot13 /USR/SBIN/CROND[11359]: > (root) CMD (/opt/cpu_ss7gw/current/scripts/mgmt_apache_watchdog) > May 19 05:03:21 [SEQ 249182] dcsx5testslot12 kernel: [673637.606852] > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0] > May 19 05:03:21 [SEQ 249183] dcsx5testslot12 kernel: [673637.607791] > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0] > May 19 05:03:21 [SEQ 249184] dcsx5testslot12 kernel: [673637.607817] > Modules linked in: nf_log_ipv4 nf_log_common xt_LOG
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
On May 29, 2016 23:32, "Jon Maloy"wrote: > > Hi Guna, > I am looking at it, but don't have much time to spend at it right now. > A further study of your dump makes me believe this is a case of race between tipc_recv_stream(), which in user context is setting the "owned" flag but not grabbing slock, and tipc_sk_timeout(), which in interrupt mode is grabbing slock without checking the "owned" flag. we shouldn't spin in the timer cb, but if (sock_owned_by_user(sk)) we can reschedule timer for a retry in a few jiffies > This creates an obvious risk of a race, but I don't yet understand the exact implications or solution to it. It may be that it is sufficient to just replace the grabbing of slock with a call to lock_sock() in the timer function as I first suggested, but I rather believe we have to explicitly grab both slock and the flag and keep them for the duration of the timeout, before tipc_node_xmit_skb(). This will protect the timeout against interference from both user context (tipc_recv_stream() and other sw interrups (tipc_sk_rcv()) and vice versa. Or maybe not even this is enough? > > I will continue my analysis, but input from others would be appreciated. > > ///jon > > > > -Original Message- > > From: GUNA [mailto:gbala...@gmail.com] > > Sent: Saturday, 28 May, 2016 06:00 > > To: Jon Maloy; tipc-discussion@lists.sourceforge.net; Erik Hugne > > Subject: Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0 > > > > Any update on the issue? Any other thoughts or possible fix ? > > > > The issue was seen on slot12 (1.1.12) node only. The other slots were up. > > > > I got the full logs as listed here: > > > > > > May 19 05:03:01 [SEQ 248049] dcsx5testslot13 /USR/SBIN/CROND[11359]: > > (root) CMD (/opt/cpu_ss7gw/current/scripts/mgmt_apache_watchdog) > > May 19 05:03:21 [SEQ 249182] dcsx5testslot12 kernel: [673637.606852] > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0] > > May 19 05:03:21 [SEQ 249183] dcsx5testslot12 kernel: [673637.607791] > > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0] > > May 19 05:03:21 [SEQ 249184] dcsx5testslot12 kernel: [673637.607817] > > Modules linked in: nf_log_ipv4 nf_log_common xt_LOG sctp libcrc32c > > e1000e tip > > c udp_tunnel ip6_udp_tunnel 8021q garp iTCO_wdt xt_physdev > > br_netfilter bridge stp llc nf_conntrack_ipv4 ipmiq_drv(O) sio_mmc(O) > > nf_defrag_ipv4 ip6 > > t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state > > nf_conntrack event_drv(O) ip6table_filter ip6_tables ddi(O) lockd > > pt_timer_info(O > > ) grace ixgbe usb_storage igb pcspkr iTCO_vendor_support i2c_algo_bit > > i2c_i801 ptp i2c_core ioatdma lpc_ich mfd_core tpm_tis dca pps_core > > intel_ips > > tpm mdio sunrpc [last unloaded: iTCO_wdt] > > May 19 05:03:21 [SEQ 249185] dcsx5testslot12 kernel: [673637.607819] > > CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O4.4.0 #17 > > May 19 05:03:21 [SEQ 249186] dcsx5testslot12 kernel: [673637.607820] > > Hardware name: PT AMC124/Base Board Product Name, BIOS > > LGNAJFIP.PTI.0012.P15 > > 01/15/2014 > > May 19 05:03:21 [SEQ 249187] dcsx5testslot12 kernel: [673637.607821] > > task: 880351a88000 ti: 880351a84000 task.ti: 880351a84000 > > May 19 05:03:21 [SEQ 249188] dcsx5testslot12 kernel: [673637.607827] > > RIP: 0010:[] [] > > rhashtable_jhash2+0x0/0x > > f0 > > May 19 05:03:21 [SEQ 249189] dcsx5testslot12 kernel: [673637.607828] > > RSP: 0018:88035fc63a60 EFLAGS: 0206 > > May 19 05:03:21 [SEQ 249190] dcsx5testslot12 kernel: [673637.607829] > > RAX: 880351613a80 RBX: 880347948000 RCX: 880347949060 > > May 19 05:03:21 [SEQ 249191] dcsx5testslot12 kernel: [673637.607830] > > RDX: 03ff6972 RSI: 0001 RDI: 88035fc63a84 > > May 19 05:03:21 [SEQ 249192] dcsx5testslot12 kernel: [673637.607830] > > RBP: 88035fc63ab8 R08: 0001 R09: 0004 > > May 19 05:03:21 [SEQ 249193] dcsx5testslot12 kernel: [673637.607831] > > R10: R11: R12: 88035fc63bd0 > > May 19 05:03:21 [SEQ 249194] dcsx5testslot12 kernel: [673637.607832] > > R13: 88009a4dc000 R14: 8803502f5640 R15: 88035fc63be4 > > May 19 05:03:21 [SEQ 249195] dcsx5testslot12 kernel: [673637.607833] > > FS: () GS:88035fc6() > > knlGS: > > May 19 05:03:21 [SEQ 249196] dcsx5testslot12 kernel: [673637.607834] > > CS: 0010 DS: ES: CR0: 8005003b > > May 19 05:03:21 [SEQ 249197] dcsx5testslot12 kernel: [673637.607835] > > CR2: 7f6e9c244000 CR3: 01c0a000 CR4: 06e0 > > May 19 05:03:21 [SEQ 249198] dcsx5testslot12 kernel: [673637.607835] Stack: > > :6600 > > May 19 05:04:45 [SEQ 252476] dcsx5testslot12 kernel: [673721.608818] > > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/3:0] > > May 19 05:04:45 [SEQ 252477] dcsx5testslot12 kernel:
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
On 05/24/2016 12:16 PM, GUNA wrote: > I suspect there could be glitch on switch may cause lost the probe or > abort message. However, even if the messages are lost for what ever > reason, is not TIPC stack should handle the graceful shutdown of the > TIPC connection by releasing all the resources instead of panic or > dead itself ? I don't think this is about lost packets or shutdown at all, as far as I can see form the dump.This is a node local socket-to-socket call that bypasses any interfaces and switches. > > Does lock_sock/release_sock use in tipc_sk_timeout() fix the issue ? No, it won't. At first I thought this obvious, as can be deferred from my first comment: tipc_sk_timeout() grabs sk.slock, sends a message to the receiver socket, which for some reason chooses to return it to sender. In tipc_sk_rcv we try to grab sk.slock again, and hence we end up in a deadlock. Unfortunately this theory is not correct. tipc_sk_timeout() actually releases sk.slock before it sends out the buffer, so there is no way a deadlock can happen this way. Besides, given that we don't actually spin on a spin_trylock_bh() call, I believe we are on the wrong track when suspecting a slock deadlock. A message produced by tipc_sk_timeout can only be a CONN_MANAGER/CONN_PROBE message, but I don't see filter_rcv() calling tipc_sk_proto_rcv() anywhere. The explanation seems to be that tipc_sk_proto_rcv() is inlined. So, we can probably assume that the message causing the deadlock is a regular CONN_MANAGER/CONN_PROBE_REPLY. So, my theory right now is that the reply message for some reason causes tipc_sk_rv() to enter a loop where 'inputq' never is emptied. This is as far as I get in my analysis right now. Does anybody else have any good ideas? ///jon > > Thanks, > Guna > > -- > Mobile security can be enabling, not merely restricting. Employees who > bring their own devices (BYOD) to work are irked by the imposition of MDM > restrictions. Mobile Device Manager Plus allows you to control only the > apps on BYO-devices by containerizing them, leaving personal data untouched! > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > ___ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j ___ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion
[tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
I suspect there could be glitch on switch may cause lost the probe or abort message. However, even if the messages are lost for what ever reason, is not TIPC stack should handle the graceful shutdown of the TIPC connection by releasing all the resources instead of panic or dead itself ? Does lock_sock/release_sock use in tipc_sk_timeout() fix the issue ? Thanks, Guna -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j ___ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
> -Original Message- > From: GUNA [mailto:gbala...@gmail.com] > Sent: Friday, 20 May, 2016 11:04 > To: Erik Hugne > Cc: Richard Alpe; Ying Xue; Parthasarathy Bhuvaragan; Jon Maloy; tipc- > discuss...@lists.sourceforge.net > Subject: Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card > on 4.4.0 > > Thanks Erik for your quick analysis. > If it is not known issue, are there any expert available to > investigate it further why this lockup happen? Otherwise let me know > the patch or fix information. > > // Guna > > On Fri, May 20, 2016 at 1:19 AM, Erik Hugnewrote: > > A little more awake now. Didnt see this yesterday. > > Look at the trace from CPU2 in Guna's initial mail. > > > > TIPC is recursing into the receive loop a second time, and freezes when it > > tries to take slock a second time. this is done in a timer CB, and softirq > > lockup detector kicks in after ~20s. This happens because the socket timer does not set the "owned_by_user" flag like ordinary users do; it just grabs slock as a spinlock. I wonder if we could let tipc_sk_timeout() use lock_sock/release_sock here, although it is not in user context ? This is most certainly one of the background probe messages that is sent once an hour between connected sockets, to verify that the peer socket is still there. In this case the peer socket has disappeared, but then it remains to understand why the sender has not already received an abort message when that happened. Was the connection never set up completely? Is it just bad timing, so the abort message is already waiting in the sender's backlog queue? Or it’s the abort mechanism broken somehow? ///jon > > > > //E > > > > [686797.257426] > > > > [686797.257426] [] _raw_spin_trylock_bh+0x40/0x50 > > > > [686797.257430] [] tipc_sk_rcv+0xbc/0x490 [tipc] > > > > [686797.257432] [] ? tcp_rcv_established+0x40e/0x760 > > > > [686797.257435] [] tipc_node_xmit+0x11f/0x150 [tipc] > > > > [686797.257437] [] ? find_busiest_group+0x153/0x980 > > > > [686797.257441] [] tipc_node_xmit_skb+0x37/0x60 [tipc] > > > > [686797.257444] [] tipc_sk_respond+0x99/0xc0 [tipc] > > > > [686797.257447] [] filter_rcv+0x4cd/0x550 [tipc] > > > > [686797.257451] [] tipc_sk_rcv+0x2dd/0x490 [tipc] > > > > [686797.257454] [] tipc_node_xmit+0x11f/0x150 [tipc] > > > > [686797.257458] [] ? tipc_recv_stream+0x370/0x370 [tipc] > > > > [686797.257461] [] tipc_node_xmit_skb+0x37/0x60 [tipc] > > > > [686797.257464] [] tipc_sk_timeout+0xe0/0x180 [tipc > > > > On May 19, 2016 21:37, "GUNA" wrote: > > > > All the CPU cards on the system running the same load. Seen similar > > issue about 6 weeks back but seen again now on one card compared to > > all cards last time. At this time, there was very light traffic > > (handshake). > > > > I had seen following as part of the log, not sure it contributes the > > issue or not: > > > > [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > > [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: > > [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! > > > > // details > > ... > > [686797.890210] [] ret_from_fork+0x3f/0x70 > > [686797.895692] [] ? flush_kthread_worker+0x90/0x90 > > [686797.901951] Code: 00 eb 02 89 c6 f7 c6 00 ff ff ff 75 41 83 fe 01 > > 89 ca 89 f0 41 0f 44 d0 f0 0f b1 17 39 f0 75 e3 83 fa 01 75 04 eb 0d > > f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 5d c3 8b 37 81 fe 00 01 00 00 > > 75 b6 > > [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > > [686803.938207] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > > [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > > [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: > > [686810.067613] 2-...: (2 GPs behind) idle=a27/1/0 > > softirq=70531357/70531357 fqs=4305315 > > [686810.075606] (detected by 1, t=13200382 jiffies, > > g=173829751, c=173829750, q=25641590) > > [686810.083624] 880351a83e68 0018 81591bf1 > > 880351a83ec8 > > [686810.091163] 0002005932b8 00010006 880351a84000 > > 81d1ce20 > > [686810.098697] 880351a84000 88035fc5d300 81cb2c00 > > 880351a83eb8 > > [686810.106233] Call Trace: > > [686810.108767] [] ? cpuidle_enter_state+0x91/0x200 > > [686810.115026] [] ? cpuidle_enter+0x17/0x20 > > [686810.120673] [] ? call_cpuidle+0x37/0x60 > > [686810.126234] [] ? cpuidle_select+0x13/0x20 > > [686810.131978] [] ? cpu_startup_entry+0x211/0x2d0 > > [686810.138156] [] ? start_secondary+0x103/0x130 > > [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! > > [swapper/3:0] > > > > On Thu, May 19, 2016 at 3:06 PM, Erik Hugne wrote: > >> On Thu, May 19, 2016 at 10:34:05AM -0400, GUNA wrote: > >>> One of the card in my system is dead and rebooted to recover it. > >>> The system is
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
Thanks Erik for your quick analysis. If it is not known issue, are there any expert available to investigate it further why this lockup happen? Otherwise let me know the patch or fix information. // Guna On Fri, May 20, 2016 at 1:19 AM, Erik Hugnewrote: > A little more awake now. Didnt see this yesterday. > Look at the trace from CPU2 in Guna's initial mail. > > TIPC is recursing into the receive loop a second time, and freezes when it > tries to take slock a second time. this is done in a timer CB, and softirq > lockup detector kicks in after ~20s. > > //E > > [686797.257426] > > [686797.257426] [] _raw_spin_trylock_bh+0x40/0x50 > > [686797.257430] [] tipc_sk_rcv+0xbc/0x490 [tipc] > > [686797.257432] [] ? tcp_rcv_established+0x40e/0x760 > > [686797.257435] [] tipc_node_xmit+0x11f/0x150 [tipc] > > [686797.257437] [] ? find_busiest_group+0x153/0x980 > > [686797.257441] [] tipc_node_xmit_skb+0x37/0x60 [tipc] > > [686797.257444] [] tipc_sk_respond+0x99/0xc0 [tipc] > > [686797.257447] [] filter_rcv+0x4cd/0x550 [tipc] > > [686797.257451] [] tipc_sk_rcv+0x2dd/0x490 [tipc] > > [686797.257454] [] tipc_node_xmit+0x11f/0x150 [tipc] > > [686797.257458] [] ? tipc_recv_stream+0x370/0x370 [tipc] > > [686797.257461] [] tipc_node_xmit_skb+0x37/0x60 [tipc] > > [686797.257464] [] tipc_sk_timeout+0xe0/0x180 [tipc > > On May 19, 2016 21:37, "GUNA" wrote: > > All the CPU cards on the system running the same load. Seen similar > issue about 6 weeks back but seen again now on one card compared to > all cards last time. At this time, there was very light traffic > (handshake). > > I had seen following as part of the log, not sure it contributes the > issue or not: > > [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: > [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! > > // details > ... > [686797.890210] [] ret_from_fork+0x3f/0x70 > [686797.895692] [] ? flush_kthread_worker+0x90/0x90 > [686797.901951] Code: 00 eb 02 89 c6 f7 c6 00 ff ff ff 75 41 83 fe 01 > 89 ca 89 f0 41 0f 44 d0 f0 0f b1 17 39 f0 75 e3 83 fa 01 75 04 eb 0d > f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 5d c3 8b 37 81 fe 00 01 00 00 > 75 b6 > [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > [686803.938207] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: > [686810.067613] 2-...: (2 GPs behind) idle=a27/1/0 > softirq=70531357/70531357 fqs=4305315 > [686810.075606] (detected by 1, t=13200382 jiffies, > g=173829751, c=173829750, q=25641590) > [686810.083624] 880351a83e68 0018 81591bf1 > 880351a83ec8 > [686810.091163] 0002005932b8 00010006 880351a84000 > 81d1ce20 > [686810.098697] 880351a84000 88035fc5d300 81cb2c00 > 880351a83eb8 > [686810.106233] Call Trace: > [686810.108767] [] ? cpuidle_enter_state+0x91/0x200 > [686810.115026] [] ? cpuidle_enter+0x17/0x20 > [686810.120673] [] ? call_cpuidle+0x37/0x60 > [686810.126234] [] ? cpuidle_select+0x13/0x20 > [686810.131978] [] ? cpu_startup_entry+0x211/0x2d0 > [686810.138156] [] ? start_secondary+0x103/0x130 > [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! > [swapper/3:0] > > On Thu, May 19, 2016 at 3:06 PM, Erik Hugne wrote: >> On Thu, May 19, 2016 at 10:34:05AM -0400, GUNA wrote: >>> One of the card in my system is dead and rebooted to recover it. >>> The system is running on Kernel 4.4.0 + some latest TIPC patches. >>> Your earliest feedback of the issue is recommended. >>> >> At first i thought this might be a spinlock contention problem. >> >> CPU2 is receiving TIPC traffic on a socket, and is trying to grab a >> spinlock in tipc_sk_rcv context (probably sk->sk_lock.slock) >> First argument to spin_trylock_bh() is passed in RDI: a01546cc >> >> CPU3 is sending TIPC data, tipc_node_xmit()->tipc_sk_rcv() indicates >> that it's traffic between sockets on the same machine. >> And i think this is the same socket as on CPU2, because we see the same >> address in RDI: a01546cc >> >> But this made me unsure: >> [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx >> timeout >> Is it contributing to the problem, or is it a side effect of a spinlock >> contention? >> >> Driver (or HW) bugs _are_ fatal for a network stack, but why would a lock >> contention >> in a network stack cause NIC TX timeouts? >> >> Does all cards in your system have similar workloads? >> Do you see this on multiple cards? >> >> //E -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
A little more awake now. Didnt see this yesterday. Look at the trace from CPU2 in Guna's initial mail. TIPC is recursing into the receive loop a second time, and freezes when it tries to take slock a second time. this is done in a timer CB, and softirq lockup detector kicks in after ~20s. //E [686797.257426] [686797.257426] [] _raw_spin_trylock_bh+0x40/0x50 [686797.257430] [] tipc_sk_rcv+0xbc/0x490 [tipc] [686797.257432] [] ? tcp_rcv_established+0x40e/0x760 [686797.257435] [] tipc_node_xmit+0x11f/0x150 [tipc] [686797.257437] [] ? find_busiest_group+0x153/0x980 [686797.257441] [] tipc_node_xmit_skb+0x37/0x60 [tipc] [686797.257444] [] tipc_sk_respond+0x99/0xc0 [tipc] [686797.257447] [] filter_rcv+0x4cd/0x550 [tipc] [686797.257451] [] tipc_sk_rcv+0x2dd/0x490 [tipc] [686797.257454] [] tipc_node_xmit+0x11f/0x150 [tipc] [686797.257458] [] ? tipc_recv_stream+0x370/0x370 [tipc] [686797.257461] [] tipc_node_xmit_skb+0x37/0x60 [tipc] [686797.257464] [] tipc_sk_timeout+0xe0/0x180 [tipc On May 19, 2016 21:37, "GUNA"wrote: All the CPU cards on the system running the same load. Seen similar issue about 6 weeks back but seen again now on one card compared to all cards last time. At this time, there was very light traffic (handshake). I had seen following as part of the log, not sure it contributes the issue or not: [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! // details ... [686797.890210] [] ret_from_fork+0x3f/0x70 [686797.895692] [] ? flush_kthread_worker+0x90/0x90 [686797.901951] Code: 00 eb 02 89 c6 f7 c6 00 ff ff ff 75 41 83 fe 01 89 ca 89 f0 41 0f 44 d0 f0 0f b1 17 39 f0 75 e3 83 fa 01 75 04 eb 0d f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 5d c3 8b 37 81 fe 00 01 00 00 75 b6 [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686803.938207] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: [686810.067613] 2-...: (2 GPs behind) idle=a27/1/0 softirq=70531357/70531357 fqs=4305315 [686810.075606] (detected by 1, t=13200382 jiffies, g=173829751, c=173829750, q=25641590) [686810.083624] 880351a83e68 0018 81591bf1 880351a83ec8 [686810.091163] 0002005932b8 00010006 880351a84000 81d1ce20 [686810.098697] 880351a84000 88035fc5d300 81cb2c00 880351a83eb8 [686810.106233] Call Trace: [686810.108767] [] ? cpuidle_enter_state+0x91/0x200 [686810.115026] [] ? cpuidle_enter+0x17/0x20 [686810.120673] [] ? call_cpuidle+0x37/0x60 [686810.126234] [] ? cpuidle_select+0x13/0x20 [686810.131978] [] ? cpu_startup_entry+0x211/0x2d0 [686810.138156] [] ? start_secondary+0x103/0x130 [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0] On Thu, May 19, 2016 at 3:06 PM, Erik Hugne wrote: > On Thu, May 19, 2016 at 10:34:05AM -0400, GUNA wrote: >> One of the card in my system is dead and rebooted to recover it. >> The system is running on Kernel 4.4.0 + some latest TIPC patches. >> Your earliest feedback of the issue is recommended. >> > At first i thought this might be a spinlock contention problem. > > CPU2 is receiving TIPC traffic on a socket, and is trying to grab a > spinlock in tipc_sk_rcv context (probably sk->sk_lock.slock) > First argument to spin_trylock_bh() is passed in RDI: a01546cc > > CPU3 is sending TIPC data, tipc_node_xmit()->tipc_sk_rcv() indicates > that it's traffic between sockets on the same machine. > And i think this is the same socket as on CPU2, because we see the same > address in RDI: a01546cc > > But this made me unsure: > [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > Is it contributing to the problem, or is it a side effect of a spinlock contention? > > Driver (or HW) bugs _are_ fatal for a network stack, but why would a lock contention > in a network stack cause NIC TX timeouts? > > Does all cards in your system have similar workloads? > Do you see this on multiple cards? > > //E -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j ___ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
All the CPU cards on the system running the same load. Seen similar issue about 6 weeks back but seen again now on one card compared to all cards last time. At this time, there was very light traffic (handshake). I had seen following as part of the log, not sure it contributes the issue or not: [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! // details ... [686797.890210] [] ret_from_fork+0x3f/0x70 [686797.895692] [] ? flush_kthread_worker+0x90/0x90 [686797.901951] Code: 00 eb 02 89 c6 f7 c6 00 ff ff ff 75 41 83 fe 01 89 ca 89 f0 41 0f 44 d0 f0 0f b1 17 39 f0 75 e3 83 fa 01 75 04 eb 0d f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 5d c3 8b 37 81 fe 00 01 00 00 75 b6 [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686803.938207] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686808.930065] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: [686810.067613] 2-...: (2 GPs behind) idle=a27/1/0 softirq=70531357/70531357 fqs=4305315 [686810.075606] (detected by 1, t=13200382 jiffies, g=173829751, c=173829750, q=25641590) [686810.083624] 880351a83e68 0018 81591bf1 880351a83ec8 [686810.091163] 0002005932b8 00010006 880351a84000 81d1ce20 [686810.098697] 880351a84000 88035fc5d300 81cb2c00 880351a83eb8 [686810.106233] Call Trace: [686810.108767] [] ? cpuidle_enter_state+0x91/0x200 [686810.115026] [] ? cpuidle_enter+0x17/0x20 [686810.120673] [] ? call_cpuidle+0x37/0x60 [686810.126234] [] ? cpuidle_select+0x13/0x20 [686810.131978] [] ? cpu_startup_entry+0x211/0x2d0 [686810.138156] [] ? start_secondary+0x103/0x130 [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0] On Thu, May 19, 2016 at 3:06 PM, Erik Hugnewrote: > On Thu, May 19, 2016 at 10:34:05AM -0400, GUNA wrote: >> One of the card in my system is dead and rebooted to recover it. >> The system is running on Kernel 4.4.0 + some latest TIPC patches. >> Your earliest feedback of the issue is recommended. >> > At first i thought this might be a spinlock contention problem. > > CPU2 is receiving TIPC traffic on a socket, and is trying to grab a > spinlock in tipc_sk_rcv context (probably sk->sk_lock.slock) > First argument to spin_trylock_bh() is passed in RDI: a01546cc > > CPU3 is sending TIPC data, tipc_node_xmit()->tipc_sk_rcv() indicates > that it's traffic between sockets on the same machine. > And i think this is the same socket as on CPU2, because we see the same > address in RDI: a01546cc > > But this made me unsure: > [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout > Is it contributing to the problem, or is it a side effect of a spinlock > contention? > > Driver (or HW) bugs _are_ fatal for a network stack, but why would a lock > contention > in a network stack cause NIC TX timeouts? > > Does all cards in your system have similar workloads? > Do you see this on multiple cards? > > //E -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j ___ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion
Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
On Thu, May 19, 2016 at 10:34:05AM -0400, GUNA wrote: > One of the card in my system is dead and rebooted to recover it. > The system is running on Kernel 4.4.0 + some latest TIPC patches. > Your earliest feedback of the issue is recommended. > At first i thought this might be a spinlock contention problem. CPU2 is receiving TIPC traffic on a socket, and is trying to grab a spinlock in tipc_sk_rcv context (probably sk->sk_lock.slock) First argument to spin_trylock_bh() is passed in RDI: a01546cc CPU3 is sending TIPC data, tipc_node_xmit()->tipc_sk_rcv() indicates that it's traffic between sockets on the same machine. And i think this is the same socket as on CPU2, because we see the same address in RDI: a01546cc But this made me unsure: [686798.930348] ixgbe :01:00.0 p19p2: initiating reset due to tx timeout Is it contributing to the problem, or is it a side effect of a spinlock contention? Driver (or HW) bugs _are_ fatal for a network stack, but why would a lock contention in a network stack cause NIC TX timeouts? Does all cards in your system have similar workloads? Do you see this on multiple cards? //E -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j ___ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion
[tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0
One of the card in my system is dead and rebooted to recover it. The system is running on Kernel 4.4.0 + some latest TIPC patches. Your earliest feedback of the issue is recommended. The cascaded failure logs are following: [686797.257405] Modules linked in: nf_log_ipv4 nf_log_common xt_LOG sctp libcrc32c e1000e tipc udp_tunnel ip6_udp_tunnel 8021q garp iTCO_wdt xt_physdev br_netfilter bridge stp llc nf_conntrack_ipv4 ipmiq_drv(O) sio_mmc(O) nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack event_drv(O) ip6table_filter ip6_tables ddi(O) lockd pt_timer_info(O) grace ixgbe usb_storage igb pcspkr iTCO_vendor_support i2c_algo_bit i2c_i801 ptp i2c_core ioatdma lpc_ich mfd_core tpm_tis dca pps_core intel_ips tpm mdio sunrpc [last unloaded: iTCO_wdt] [686797.257407] CPU: 2 PID: 0 Comm: swapper/2 Tainted: GW O L 4.4.0 #17 [686797.257407] Hardware name: PT AMC124/Base Board Product Name, BIOS LGNAJFIP.PTI.0012.P15 01/15/2014 [686797.257408] task: 880351a751c0 ti: 880351a8 task.ti: 880351a8 [686797.257411] RIP: 0010:[] [] __local_bh_enable_ip+0x4c/0x90 [686797.257412] RSP: 0018:88035fc43aa0 EFLAGS: 0296 [686797.257412] RAX: 0301 RBX: fe01 RCX: 0001 [686797.257413] RDX: 0003 RSI: 0200 RDI: a01546cc [686797.257414] RBP: 88035fc43aa8 R08: 02d4 R09: 0004 [686797.257415] R10: R11: R12: 88035fc43bd0 [686797.257415] R13: ab7500c6 R14: 880350124ec0 R15: 88035fc43be4 [686797.257416] FS: () GS:88035fc4() knlGS: [686797.257417] CS: 0010 DS: ES: CR0: 8005003b [686797.257418] CR2: 7f6e9c244000 CR3: 01c0a000 CR4: 06e0 [686797.257418] Stack: [686797.257420] 88035fc43b70 88035fc43ab8 816de340 88035fc43b38 [686797.257421] a01546cc 88034a8fad10 0020 88035fc43af8 [686797.257423] 81ce6240 880350124f48 00012821c002 88035fc43b58 [686797.257423] Call Trace: [686797.257426] [686797.257426] [] _raw_spin_trylock_bh+0x40/0x50 [686797.257430] [] tipc_sk_rcv+0xbc/0x490 [tipc] [686797.257432] [] ? tcp_rcv_established+0x40e/0x760 [686797.257435] [] tipc_node_xmit+0x11f/0x150 [tipc] [686797.257437] [] ? find_busiest_group+0x153/0x980 [686797.257441] [] tipc_node_xmit_skb+0x37/0x60 [tipc] [686797.257444] [] tipc_sk_respond+0x99/0xc0 [tipc] [686797.257447] [] filter_rcv+0x4cd/0x550 [tipc] [686797.257451] [] tipc_sk_rcv+0x2dd/0x490 [tipc] [686797.257454] [] tipc_node_xmit+0x11f/0x150 [tipc] [686797.257458] [] ? tipc_recv_stream+0x370/0x370 [tipc] [686797.257461] [] tipc_node_xmit_skb+0x37/0x60 [tipc] [686797.257464] [] tipc_sk_timeout+0xe0/0x180 [tipc] [686797.257468] [] ? tipc_recv_stream+0x370/0x370 [tipc] [686797.257469] [] call_timer_fn+0x44/0x110 [686797.257470] [] ? cascade+0x4a/0x80 [686797.257474] [] ? tipc_recv_stream+0x370/0x370 [tipc] [686797.257475] [] run_timer_softirq+0x22c/0x280 [686797.257477] [] __do_softirq+0xc8/0x260 [686797.257478] [] irq_exit+0x83/0xb0 [686797.257480] [] do_IRQ+0x65/0xf0 [686797.257481] [] common_interrupt+0x7f/0x7f [686797.257484] [686797.257484] [] ? cpuidle_enter_state+0xad/0x200 [686797.257485] [] ? cpuidle_enter_state+0x91/0x200 [686797.257486] [] cpuidle_enter+0x17/0x20 [686797.257488] [] call_cpuidle+0x37/0x60 [686797.257489] [] ? cpuidle_select+0x13/0x20 [686797.257490] [] cpu_startup_entry+0x211/0x2d0 [686797.257491] [] start_secondary+0x103/0x130 [686797.257506] Code: 80 c7 00 01 75 31 83 eb 01 f7 db 65 01 1d 05 6e f8 7e 65 8b 05 fe 6d f8 7e a9 00 ff 1f 00 74 31 65 ff 0d f0 6d f8 7e 48 83 c4 08 <5b> 5d c3 9c 58 f6 c4 02 75 d1 eb c6 be 96 00 00 00 48 c7 c7 82 [686797.640076] Modules linked in: nf_log_ipv4 nf_log_common xt_LOG sctp libcrc32c e1000e tipc udp_tunnel ip6_udp_tunnel 8021q garp iTCO_wdt xt_physdev br_netfilter bridge stp llc nf_conntrack_ipv4 ipmiq_drv(O) sio_mmc(O) nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack event_drv(O) ip6table_filter ip6_tables ddi(O) lockd pt_timer_info(O) grace ixgbe usb_storage igb pcspkr iTCO_vendor_support i2c_algo_bit i2c_i801 ptp i2c_core ioatdma lpc_ich mfd_core tpm_tis dca pps_core intel_ips tpm mdio sunrpc [last unloaded: iTCO_wdt] [686797.691099] CPU: 0 PID: 28367 Comm: kworker/u32:2 Tainted: G W O L 4.4.0 #17 [686797.699005] Hardware name: PT AMC124/Base Board Product Name, BIOS LGNAJFIP.PTI.0012.P15 01/15/2014 [686797.708119] Workqueue: tipc_send tipc_send_work [tipc] [686797.713356] task: 88034e8ad1c0 ti: 8800998f task.ti: 8800998f [686797.720906] RIP: 0010:[] [] queued_spin_lock_slowpath+0x46/0x160 [686797.730220] RSP: 0018:8800998f3c78 EFLAGS: 0202 [686797.735606] RAX: 0101