Re: [tipc-discussion] TIPC connection stalling due to invalid congestion status when bearer 0 recovers

Butler, Peter Mon, 24 Jul 2017 12:58:54 -0700

I just tried to recreate the issue with official kernel 4.12.3 (unpatched).  
Instead of the behaviour I described before, now the kernel crashes:


[ 2385.096807] general protection fault: 0000 [#1] SMP
[ 2385.101720] Modules linked in: sctp e1000e tipc udp_tunnel ip6_udp_tunnel 
iTCO_wdt 8021q garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT 
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack libcrc32c 
ip6table_filter lockd ip6_tables grace igb usb_storage ixgbe 
iTCO_vendor_support i2c_i801 i2c_algo_bit pcspkr intel_ips ioatdma ptp i2c_core 
tpm_tis lpc_ich pps_core tpm_tis_core dca mfd_core mdio tpm sunrpc [last 
unloaded: iTCO_wdt]
[ 2385.142882] CPU: 1 PID: 10980 Comm: yj4flx Not tainted 4.12.3 #1
[ 2385.148996] Hardware name: PT AMC124/Base Board Product Name, BIOS 
LGNAJFIP.PTI.0012.P15 01/15/2014
[ 2385.158240] task: ffff88034e125280 task.stack: ffffc90005380000
[ 2385.164299] RIP: 0010:kfree_skb_list+0x18/0x30
[ 2385.168923] RSP: 0018:ffffc90005383b18 EFLAGS: 00010202
[ 2385.174319] RAX: 0000000000000004 RBX: ffff88034da506c0 RCX: ffff88034da52600
[ 2385.181777] RDX: ffffc90005383ce0 RSI: ffffffffffffffb8 RDI: 0510000109100001
[ 2385.189253] RBP: ffffc90005383b28 R08: 00000000ffffffb8 R09: 0000000000000300
[ 2385.196531] R10: 0000000000000050 R11: 0000000000000000 R12: ffff88034da52600
[ 2385.204171] R13: 0000000000000000 R14: 00000000fffffff2 R15: 0000000000000000
[ 2385.211838] FS:  0000000000000000(0000) GS:ffff88035fc40000(0063) 
knlGS:00000000d3e4db40
[ 2385.220426] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 2385.226358] CR2: 00000000f7420fd8 CR3: 000000035032c000 CR4: 00000000000006e0
[ 2385.233748] Call Trace:
[ 2385.236360]  skb_release_data+0xbf/0xf0
[ 2385.240314]  ? tipc_msg_build+0x100/0x450 [tipc]
[ 2385.244927]  skb_release_all+0x28/0x30
[ 2385.248746]  __kfree_skb+0x16/0x80
[ 2385.252236]  kfree_skb+0x41/0xb0
[ 2385.255633]  tipc_msg_build+0x100/0x450 [tipc]
[ 2385.260278]  ? tipc_node_put+0x1a/0x50 [tipc]
[ 2385.264749]  __tipc_sendmsg+0x1e7/0x430 [tipc]
[ 2385.269375]  ? wake_up_process+0x15/0x20
[ 2385.273445]  ? wake_up_q+0x4c/0x80
[ 2385.277066]  tipc_sendmsg+0x42/0x70 [tipc]
[ 2385.281353]  sock_sendmsg+0x47/0x50
[ 2385.284975]  SYSC_sendto+0xd9/0x110
[ 2385.288667]  ? move_addr_to_user+0xab/0xe0
[ 2385.293014]  ? SYSC_getsockname+0x65/0xa0
[ 2385.297182]  SyS_sendto+0xe/0x10
[ 2385.300640]  compat_SyS_socketcall+0x14f/0x1e0
[ 2385.305284]  do_fast_syscall_32+0x8a/0x140
[ 2385.309564]  entry_SYSENTER_compat+0x4c/0x5b
[ 2385.314074] RIP: 0023:0xf7715bf9
[ 2385.317451] RSP: 002b:00000000d3e4d068 EFLAGS: 00000296 ORIG_RAX: 
0000000000000066
[ 2385.325362] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00000000d3e4d080
[ 2385.332967] RDX: 00000000d35107f0 RSI: 0000000000000000 RDI: 00000000d3e4d158
[ 2385.340306] RBP: 00000000d3e4d1d8 R08: 0000000000000000 R09: 0000000000000000
[ 2385.347764] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2385.355190] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 2385.362667] Code: ff 8f e4 00 00 00 74 8b eb 9a 66 0f 1f 84 00 00 00 00 00 
66 66 66 66 90 55 48 89 e5 53 48 83 ec 08 48 85 ff 75 05 eb 10 48 89 df <48> 8b 
1f e8 30 ff ff ff 48 85 db 75 f0 48 83 c4 08 5b 5d c3 0f
[ 2385.382464] RIP: kfree_skb_list+0x18/0x30 RSP: ffffc90005383b18
[ 2385.388611] ---[ end trace 125f5b3fcb6ee71d ]---

-----Original Message-----
From: Butler, Peter 
Sent: July-24-17 11:21 AM
To: Parthasarathy Bhuvaragan <[email protected]>; 
[email protected]
Cc: Jon Maloy <[email protected]>; Ying Xue <[email protected]>; LUU 
Duc Canh <[email protected]>
Subject: RE: TIPC connection stalling due to invalid congestion status when 
bearer 0 recovers

Hi Partha

I can try your patch.  Is it just the one change to socket.c shown on that 
link?  I ask because that patch is named [PATCH net v1 1/6] and not sure if the 
other parts (2/6, 3/6, 4/6, 5/6, 6/6) are also required for this particular 
issue.

Peter

-----Original Message-----
From: Parthasarathy Bhuvaragan [mailto:[email protected]]
Sent: July-24-17 8:58 AM
To: Butler, Peter <[email protected]>; [email protected]
Cc: Jon Maloy <[email protected]>; Ying Xue <[email protected]>; LUU 
Duc Canh <[email protected]>
Subject: Re: TIPC connection stalling due to invalid congestion status when 
bearer 0 recovers

Hi Peter,

Have you looked through this?
https://sourceforge.net/p/tipc/mailman/message/35809792/

The symptoms you describe is identical to mine, its worth a try my patch on 
your system.

I need to address comments from Jon.M before pushing it to net-next.

regards
Partha

On 07/21/2017 10:20 PM, Butler, Peter wrote:
> Hello,
> 
> I am using a 19-node TIPC configuration, whereby each card (node) in 
> the mesh has two Ethernet interfaces connected to two disjoint subnets 
> served by switch0 and switch1, respectively. TIPC is set to use two 
> bearers on each card.  16 of these cards are using TIPC 4.4.0 (with a 
> few patches backported from later releases as per John Maloy, 
> Parthasarathy Bhuvaragan, and Ying Xue).  (The other 3 cards are using 
> a much older 1.7-based TIPC, but are not actually involved in the 
> testing pertaining to the issue detailed below.)
> 
> There are applications on several of the (4.4.0-based) cards which are 
> collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not each).
> 
> When I reboot switch0, I often get strange behaviour soon after the switch 
> comes back into service.  To be clear, there are no issues that appear to 
> stem from the loss of connectivity on the switch0 Ethernet fabric: while that 
> switch is rebooting (or powered off, or otherwise unavailable) the 
> applications behave fine by using the Ethernet fabric associated with 
> switch1.  However, shortly after switch0 returns to service, one or more of 
> the cards in the TIPC mesh will often then experience problems on the switch0 
> fabric.
> 
> Specifically, the sendto() calls (on the cards in question) will fail.  By 
> default, we are using a blocking sendto() call, and the associated process is 
> being put to sleep by the kernel at this line in socket.c:
> 
> static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p) {
>     struct sock *sk = sock->sk;
>     struct tipc_sock *tsk = tipc_sk(sk);
>     DEFINE_WAIT(wait);
>     int done;
> 
>     do {
>        int err = sock_error(sk);
>        if (err)
>           return err;
>        if (sock->state == SS_DISCONNECTING)
>           return -EPIPE;
>        if (!*timeo_p)
>           return -EAGAIN;
>        if (signal_pending(current))
>           return sock_intr_errno(*timeo_p);
> 
>        prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>        done = sk_wait_event(sk, timeo_p, !tsk->link_cong);                    
>       <---------------------
>        finish_wait(sk_sleep(sk), &wait);
>     } while (!done);
>     return 0;
> }
> 
> Once in this state the process never recovers, and at the very least needs to 
> be killed off and restarted, or the card rebooted.
> 
> When changing this to a non-blocking sendto() call, the process is no longer 
> put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once 
> again at the very least needs to be killed off and restarted, or the card 
> rebooted.
> 
> The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and 
> with destination-droppable set to false.
> 
> Note that the hardware setup I am using is essentially identical to that used 
> by Andrew Booth in his recent post "TIPC issue: connection stalls when switch 
> for bearer 0 recovers" - both issues are almost certainly related, if not 
> identical.  Although in each of our cases the problem was detected using 
> different application-level software.
> 
> Could it be that TIPC is erroneously flagging the link as being 
> congested and thus preventing any further traffic on it?  (Just 
> speculating, based on the line of code shown above.)
> 
> Peter Butler
> 
> 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] TIPC connection stalling due to invalid congestion status when bearer 0 recovers

Reply via email to