Hi again Partha Actually, cancel that last message: it looks like I cannot simply try your patch as the code it is based on is too different than what I am running. Recall that my source base is kernel 4.4.0, and I assume this patch was based on much newer source.
I will probably have to look into upgrading the entire kernel... Peter -----Original Message----- From: Butler, Peter Sent: July-24-17 11:21 AM To: Parthasarathy Bhuvaragan <[email protected]>; [email protected] Cc: Jon Maloy <[email protected]>; Ying Xue <[email protected]>; LUU Duc Canh <[email protected]> Subject: RE: TIPC connection stalling due to invalid congestion status when bearer 0 recovers Hi Partha I can try your patch. Is it just the one change to socket.c shown on that link? I ask because that patch is named [PATCH net v1 1/6] and not sure if the other parts (2/6, 3/6, 4/6, 5/6, 6/6) are also required for this particular issue. Peter -----Original Message----- From: Parthasarathy Bhuvaragan [mailto:[email protected]] Sent: July-24-17 8:58 AM To: Butler, Peter <[email protected]>; [email protected] Cc: Jon Maloy <[email protected]>; Ying Xue <[email protected]>; LUU Duc Canh <[email protected]> Subject: Re: TIPC connection stalling due to invalid congestion status when bearer 0 recovers Hi Peter, Have you looked through this? https://sourceforge.net/p/tipc/mailman/message/35809792/ The symptoms you describe is identical to mine, its worth a try my patch on your system. I need to address comments from Jon.M before pushing it to net-next. regards Partha On 07/21/2017 10:20 PM, Butler, Peter wrote: > Hello, > > I am using a 19-node TIPC configuration, whereby each card (node) in > the mesh has two Ethernet interfaces connected to two disjoint subnets > served by switch0 and switch1, respectively. TIPC is set to use two > bearers on each card. 16 of these cards are using TIPC 4.4.0 (with a > few patches backported from later releases as per John Maloy, > Parthasarathy Bhuvaragan, and Ying Xue). (The other 3 cards are using > a much older 1.7-based TIPC, but are not actually involved in the > testing pertaining to the issue detailed below.) > > There are applications on several of the (4.4.0-based) cards which are > collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not each). > > When I reboot switch0, I often get strange behaviour soon after the switch > comes back into service. To be clear, there are no issues that appear to > stem from the loss of connectivity on the switch0 Ethernet fabric: while that > switch is rebooting (or powered off, or otherwise unavailable) the > applications behave fine by using the Ethernet fabric associated with > switch1. However, shortly after switch0 returns to service, one or more of > the cards in the TIPC mesh will often then experience problems on the switch0 > fabric. > > Specifically, the sendto() calls (on the cards in question) will fail. By > default, we are using a blocking sendto() call, and the associated process is > being put to sleep by the kernel at this line in socket.c: > > static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p) { > struct sock *sk = sock->sk; > struct tipc_sock *tsk = tipc_sk(sk); > DEFINE_WAIT(wait); > int done; > > do { > int err = sock_error(sk); > if (err) > return err; > if (sock->state == SS_DISCONNECTING) > return -EPIPE; > if (!*timeo_p) > return -EAGAIN; > if (signal_pending(current)) > return sock_intr_errno(*timeo_p); > > prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); > done = sk_wait_event(sk, timeo_p, !tsk->link_cong); > <--------------------- > finish_wait(sk_sleep(sk), &wait); > } while (!done); > return 0; > } > > Once in this state the process never recovers, and at the very least needs to > be killed off and restarted, or the card rebooted. > > When changing this to a non-blocking sendto() call, the process is no longer > put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once > again at the very least needs to be killed off and restarted, or the card > rebooted. > > The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and > with destination-droppable set to false. > > Note that the hardware setup I am using is essentially identical to that used > by Andrew Booth in his recent post "TIPC issue: connection stalls when switch > for bearer 0 recovers" - both issues are almost certainly related, if not > identical. Although in each of our cases the problem was detected using > different application-level software. > > Could it be that TIPC is erroneously flagging the link as being > congested and thus preventing any further traffic on it? (Just > speculating, based on the line of code shown above.) > > Peter Butler > > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
