System is a dual 8 core Xeon. CPU usage at the time is <3%. Ethernet link with TIPC enabled have never gone over 200Mbit/sec (GigE link). Switch is a high-end HP 48 port switch, and is not near overload (and we do not see any dropped packets on the ~ 800Mbit of video we ingest on a different interface).
Only suspicions thing I've seen so far is that I'm doing some raw packet send/receive at the time of the failure (doing an arp request to check for duplicate IPs). I'm moving this to a different time to see if the TIPC failure follows. -----Original Message----- From: Jon Maloy [mailto:jon.ma...@ericsson.com] Sent: Friday, December 09, 2016 12:09 PM To: Rune Torgersen; tipc-discussion@lists.sourceforge.net Subject: RE: Transmission errors > -----Original Message----- > From: Rune Torgersen [mailto:ru...@innovsys.com] > Sent: Friday, 09 December, 2016 12:03 > To: tipc-discussion@lists.sourceforge.net > Subject: [tipc-discussion] Transmission errors > > Can someone help me decode the following errors inn my kernel log? > (Ubuntu 16.04 with 4.4.0-45) > > Dec 8 13:12:10 michelltelctrl1 kernel: [2603354.089708] Retransmission > failure on > link <1.1.1:eth0-1.1.2:eth0> > Dec 8 13:12:10 michelltelctrl1 kernel: [2603354.089712] Resetting link Link > <1.1.1:eth0-1.1.2:eth0> state e > Dec 8 13:12:10 michelltelctrl1 kernel: [2603354.089714] XMTQ: 33 > [43899-43931], > BKLGQ: 0, SNDNX: 43932, RCVNX: 21825 > Dec 8 13:12:10 michelltelctrl1 kernel: [2603354.089715] Failed msg: usr 0, > typ 2, > len 266, err 0 > Dec 8 13:12:10 michelltelctrl1 kernel: [2603354.089716] sqno 43899, prev: > 1001001, src: 1001001 Hi Rune, It means that a link is reset because TIPC made 100 failed attempts to send the same packet through it. In this case it is a data message (user 0) containing a port name address (type 2), but looking further down I see no pattern in this; it seems to happen with any type of message. What is listed is the contents of the link send queue (33 pkts, ##43899 to 43931) and the failing packet (#43899) of size 266 bytes. I can see nothing wrong with the packets or the link, but I observe another interesting pattern: all resets seem to happen on the same minute of the hour (13:12, 14:12 etc.) Is it possible that you have some recurring hourly job that overloads the switch, the interfaces or the CPUs? I think that might be point to start. BR ///jon > Dec 8 14:12:12 michelltelctrl1 kernel: [2606955.869185] Retransmission > failure on > link <1.1.1:eth0-1.1.2:eth0> > Dec 8 14:12:12 michelltelctrl1 kernel: [2606955.869197] Resetting link Link > <1.1.1:eth0-1.1.2:eth0> state e > Dec 8 14:12:12 michelltelctrl1 kernel: [2606955.869201] XMTQ: 50 > [44194-44243], > BKLGQ: 36, SNDNX: 44244, RCVNX: 21597 > Dec 8 14:12:12 michelltelctrl1 kernel: [2606955.869203] Failed msg: usr 0, > typ 2, > len 1339, err 0 > Dec 8 14:12:12 michelltelctrl1 kernel: [2606955.869205] sqno 44194, prev: > 1001001, src: 1001001 > Dec 8 16:12:17 michelltelctrl1 kernel: [2614160.280016] Retransmission > failure on > link <1.1.1:eth0-1.1.2:eth0> > Dec 8 16:12:17 michelltelctrl1 kernel: [2614160.280028] Resetting link Link > <1.1.1:eth0-1.1.2:eth0> state e > Dec 8 16:12:17 michelltelctrl1 kernel: [2614160.280034] XMTQ: 2 [6-7], > BKLGQ: 0, > SNDNX: 8, RCVNX: 5 > Dec 8 16:12:17 michelltelctrl1 kernel: [2614160.280036] Failed msg: usr 0, > typ 2, > len 266, err 0 > Dec 8 16:12:17 michelltelctrl1 kernel: [2614160.280038] sqno 6, prev: > 1001001, src: > 1001001 > Dec 8 21:12:12 michelltelctrl1 kernel: [2632155.842657] Retransmission > failure on > link <1.1.1:eth0-1.1.2:eth0> > Dec 8 21:12:12 michelltelctrl1 kernel: [2632155.842668] Resetting link Link > <1.1.1:eth0-1.1.2:eth0> state e > Dec 8 21:12:12 michelltelctrl1 kernel: [2632155.842673] XMTQ: 50 > [22133-22182], > BKLGQ: 32, SNDNX: 22183, RCVNX: 41898 > Dec 8 21:12:12 michelltelctrl1 kernel: [2632155.842674] Failed msg: usr 12, > typ 1, > len 1460, err 0 > Dec 8 21:12:12 michelltelctrl1 kernel: [2632155.842676] sqno 22133, prev: > 1001001, src: 1001001 > Dec 8 23:12:10 michelltelctrl1 kernel: [2639354.222566] Retransmission > failure on > link <1.1.1:eth0-1.1.2:eth0> > Dec 8 23:12:10 michelltelctrl1 kernel: [2639354.222578] Resetting link Link > <1.1.1:eth0-1.1.2:eth0> state e > Dec 8 23:12:10 michelltelctrl1 kernel: [2639354.222583] XMTQ: 23 > [22895-22917], > BKLGQ: 0, SNDNX: 22918, RCVNX: 42582 > Dec 8 23:12:10 michelltelctrl1 kernel: [2639354.222585] Failed msg: usr 12, > typ 1, > len 1460, err 0 > Dec 8 23:12:10 michelltelctrl1 kernel: [2639354.222587] sqno 22895, prev: > 1001001, src: 1001001 > Dec 9 07:12:06 michelltelctrl1 kernel: [2668150.028976] Retransmission > failure on > link <1.1.1:eth0-1.1.2:eth0> > Dec 9 07:12:06 michelltelctrl1 kernel: [2668150.028988] Resetting link Link > <1.1.1:eth0-1.1.2:eth0> state e > Dec 9 07:12:06 michelltelctrl1 kernel: [2668150.028995] XMTQ: 6 > [22095-22100], > BKLGQ: 0, SNDNX: 22101, RCVNX: 41522 > Dec 9 07:12:06 michelltelctrl1 kernel: [2668150.028997] Failed msg: usr 0, > typ 2, > len 266, err 0 > Dec 9 07:12:06 michelltelctrl1 kernel: [2668150.029000] sqno 22095, prev: > 1001001, src: 1001001 > Dec 9 09:12:16 michelltelctrl1 kernel: [2675360.164471] Retransmission > failure on > link <1.1.1:eth0-1.1.2:eth0> > Dec 9 09:12:16 michelltelctrl1 kernel: [2675360.164483] Resetting link Link > <1.1.1:eth0-1.1.2:eth0> state e > Dec 9 09:12:16 michelltelctrl1 kernel: [2675360.164487] XMTQ: 2 > [21277-21278], > BKLGQ: 0, SNDNX: 21279, RCVNX: 13712 > Dec 9 09:12:16 michelltelctrl1 kernel: [2675360.164489] Failed msg: usr 0, > typ 2, > len 1347, err 0 > Dec 9 09:12:16 michelltelctrl1 kernel: [2675360.164490] sqno 21277, prev: > 1001001, src: 1001001 > > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today.http://sdm.link/xeonphi > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/xeonphi _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion