Hello,
I am using a 19-node TIPC configuration, whereby each card (node) in the mesh
has two Ethernet interfaces connected to two disjoint subnets served by switch0
and switch1, respectively. TIPC is set to use two bearers on each card. 16 of
these cards are using TIPC 4.4.0 (with a few patches backported from later
releases as per John Maloy, Parthasarathy Bhuvaragan, and Ying Xue). (The
other 3 cards are using a much older 1.7-based TIPC, but are not actually
involved in the testing pertaining to the issue detailed below.)
There are applications on several of the (4.4.0-based) cards which are
collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not each).
When I reboot switch0, I often get strange behaviour soon after the switch
comes back into service. To be clear, there are no issues that appear to stem
from the loss of connectivity on the switch0 Ethernet fabric: while that switch
is rebooting (or powered off, or otherwise unavailable) the applications behave
fine by using the Ethernet fabric associated with switch1. However, shortly
after switch0 returns to service, one or more of the cards in the TIPC mesh
will often then experience problems on the switch0 fabric.
Specifically, the sendto() calls (on the cards in question) will fail. By
default, we are using a blocking sendto() call, and the associated process is
being put to sleep by the kernel at this line in socket.c:
static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p)
{
struct sock *sk = sock->sk;
struct tipc_sock *tsk = tipc_sk(sk);
DEFINE_WAIT(wait);
int done;
do {
int err = sock_error(sk);
if (err)
return err;
if (sock->state == SS_DISCONNECTING)
return -EPIPE;
if (!*timeo_p)
return -EAGAIN;
if (signal_pending(current))
return sock_intr_errno(*timeo_p);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
done = sk_wait_event(sk, timeo_p, !tsk->link_cong);
<---------------------
finish_wait(sk_sleep(sk), &wait);
} while (!done);
return 0;
}
Once in this state the process never recovers, and at the very least needs to
be killed off and restarted, or the card rebooted.
When changing this to a non-blocking sendto() call, the process is no longer
put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once
again at the very least needs to be killed off and restarted, or the card
rebooted.
The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and with
destination-droppable set to false.
Note that the hardware setup I am using is essentially identical to that used
by Andrew Booth in his recent post "TIPC issue: connection stalls when switch
for bearer 0 recovers" - both issues are almost certainly related, if not
identical. Although in each of our cases the problem was detected using
different application-level software.
Could it be that TIPC is erroneously flagging the link as being congested and
thus preventing any further traffic on it? (Just speculating, based on the
line of code shown above.)
Peter Butler
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion