I just tried to recreate the issue with official kernel 4.12.3 (unpatched). Instead of the behaviour I described before, now the kernel crashes:
[ 2385.096807] general protection fault: 0000 [#1] SMP [ 2385.101720] Modules linked in: sctp e1000e tipc udp_tunnel ip6_udp_tunnel iTCO_wdt 8021q garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack libcrc32c ip6table_filter lockd ip6_tables grace igb usb_storage ixgbe iTCO_vendor_support i2c_i801 i2c_algo_bit pcspkr intel_ips ioatdma ptp i2c_core tpm_tis lpc_ich pps_core tpm_tis_core dca mfd_core mdio tpm sunrpc [last unloaded: iTCO_wdt] [ 2385.142882] CPU: 1 PID: 10980 Comm: yj4flx Not tainted 4.12.3 #1 [ 2385.148996] Hardware name: PT AMC124/Base Board Product Name, BIOS LGNAJFIP.PTI.0012.P15 01/15/2014 [ 2385.158240] task: ffff88034e125280 task.stack: ffffc90005380000 [ 2385.164299] RIP: 0010:kfree_skb_list+0x18/0x30 [ 2385.168923] RSP: 0018:ffffc90005383b18 EFLAGS: 00010202 [ 2385.174319] RAX: 0000000000000004 RBX: ffff88034da506c0 RCX: ffff88034da52600 [ 2385.181777] RDX: ffffc90005383ce0 RSI: ffffffffffffffb8 RDI: 0510000109100001 [ 2385.189253] RBP: ffffc90005383b28 R08: 00000000ffffffb8 R09: 0000000000000300 [ 2385.196531] R10: 0000000000000050 R11: 0000000000000000 R12: ffff88034da52600 [ 2385.204171] R13: 0000000000000000 R14: 00000000fffffff2 R15: 0000000000000000 [ 2385.211838] FS: 0000000000000000(0000) GS:ffff88035fc40000(0063) knlGS:00000000d3e4db40 [ 2385.220426] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 2385.226358] CR2: 00000000f7420fd8 CR3: 000000035032c000 CR4: 00000000000006e0 [ 2385.233748] Call Trace: [ 2385.236360] skb_release_data+0xbf/0xf0 [ 2385.240314] ? tipc_msg_build+0x100/0x450 [tipc] [ 2385.244927] skb_release_all+0x28/0x30 [ 2385.248746] __kfree_skb+0x16/0x80 [ 2385.252236] kfree_skb+0x41/0xb0 [ 2385.255633] tipc_msg_build+0x100/0x450 [tipc] [ 2385.260278] ? tipc_node_put+0x1a/0x50 [tipc] [ 2385.264749] __tipc_sendmsg+0x1e7/0x430 [tipc] [ 2385.269375] ? wake_up_process+0x15/0x20 [ 2385.273445] ? wake_up_q+0x4c/0x80 [ 2385.277066] tipc_sendmsg+0x42/0x70 [tipc] [ 2385.281353] sock_sendmsg+0x47/0x50 [ 2385.284975] SYSC_sendto+0xd9/0x110 [ 2385.288667] ? move_addr_to_user+0xab/0xe0 [ 2385.293014] ? SYSC_getsockname+0x65/0xa0 [ 2385.297182] SyS_sendto+0xe/0x10 [ 2385.300640] compat_SyS_socketcall+0x14f/0x1e0 [ 2385.305284] do_fast_syscall_32+0x8a/0x140 [ 2385.309564] entry_SYSENTER_compat+0x4c/0x5b [ 2385.314074] RIP: 0023:0xf7715bf9 [ 2385.317451] RSP: 002b:00000000d3e4d068 EFLAGS: 00000296 ORIG_RAX: 0000000000000066 [ 2385.325362] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00000000d3e4d080 [ 2385.332967] RDX: 00000000d35107f0 RSI: 0000000000000000 RDI: 00000000d3e4d158 [ 2385.340306] RBP: 00000000d3e4d1d8 R08: 0000000000000000 R09: 0000000000000000 [ 2385.347764] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 2385.355190] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2385.362667] Code: ff 8f e4 00 00 00 74 8b eb 9a 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 83 ec 08 48 85 ff 75 05 eb 10 48 89 df <48> 8b 1f e8 30 ff ff ff 48 85 db 75 f0 48 83 c4 08 5b 5d c3 0f [ 2385.382464] RIP: kfree_skb_list+0x18/0x30 RSP: ffffc90005383b18 [ 2385.388611] ---[ end trace 125f5b3fcb6ee71d ]--- -----Original Message----- From: Butler, Peter Sent: July-24-17 11:21 AM To: Parthasarathy Bhuvaragan <[email protected]>; [email protected] Cc: Jon Maloy <[email protected]>; Ying Xue <[email protected]>; LUU Duc Canh <[email protected]> Subject: RE: TIPC connection stalling due to invalid congestion status when bearer 0 recovers Hi Partha I can try your patch. Is it just the one change to socket.c shown on that link? I ask because that patch is named [PATCH net v1 1/6] and not sure if the other parts (2/6, 3/6, 4/6, 5/6, 6/6) are also required for this particular issue. Peter -----Original Message----- From: Parthasarathy Bhuvaragan [mailto:[email protected]] Sent: July-24-17 8:58 AM To: Butler, Peter <[email protected]>; [email protected] Cc: Jon Maloy <[email protected]>; Ying Xue <[email protected]>; LUU Duc Canh <[email protected]> Subject: Re: TIPC connection stalling due to invalid congestion status when bearer 0 recovers Hi Peter, Have you looked through this? https://sourceforge.net/p/tipc/mailman/message/35809792/ The symptoms you describe is identical to mine, its worth a try my patch on your system. I need to address comments from Jon.M before pushing it to net-next. regards Partha On 07/21/2017 10:20 PM, Butler, Peter wrote: > Hello, > > I am using a 19-node TIPC configuration, whereby each card (node) in > the mesh has two Ethernet interfaces connected to two disjoint subnets > served by switch0 and switch1, respectively. TIPC is set to use two > bearers on each card. 16 of these cards are using TIPC 4.4.0 (with a > few patches backported from later releases as per John Maloy, > Parthasarathy Bhuvaragan, and Ying Xue). (The other 3 cards are using > a much older 1.7-based TIPC, but are not actually involved in the > testing pertaining to the issue detailed below.) > > There are applications on several of the (4.4.0-based) cards which are > collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not each). > > When I reboot switch0, I often get strange behaviour soon after the switch > comes back into service. To be clear, there are no issues that appear to > stem from the loss of connectivity on the switch0 Ethernet fabric: while that > switch is rebooting (or powered off, or otherwise unavailable) the > applications behave fine by using the Ethernet fabric associated with > switch1. However, shortly after switch0 returns to service, one or more of > the cards in the TIPC mesh will often then experience problems on the switch0 > fabric. > > Specifically, the sendto() calls (on the cards in question) will fail. By > default, we are using a blocking sendto() call, and the associated process is > being put to sleep by the kernel at this line in socket.c: > > static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p) { > struct sock *sk = sock->sk; > struct tipc_sock *tsk = tipc_sk(sk); > DEFINE_WAIT(wait); > int done; > > do { > int err = sock_error(sk); > if (err) > return err; > if (sock->state == SS_DISCONNECTING) > return -EPIPE; > if (!*timeo_p) > return -EAGAIN; > if (signal_pending(current)) > return sock_intr_errno(*timeo_p); > > prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); > done = sk_wait_event(sk, timeo_p, !tsk->link_cong); > <--------------------- > finish_wait(sk_sleep(sk), &wait); > } while (!done); > return 0; > } > > Once in this state the process never recovers, and at the very least needs to > be killed off and restarted, or the card rebooted. > > When changing this to a non-blocking sendto() call, the process is no longer > put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once > again at the very least needs to be killed off and restarted, or the card > rebooted. > > The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and > with destination-droppable set to false. > > Note that the hardware setup I am using is essentially identical to that used > by Andrew Booth in his recent post "TIPC issue: connection stalls when switch > for bearer 0 recovers" - both issues are almost certainly related, if not > identical. Although in each of our cases the problem was detected using > different application-level software. > > Could it be that TIPC is erroneously flagging the link as being > congested and thus preventing any further traffic on it? (Just > speculating, based on the line of code shown above.) > > Peter Butler > > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
