Hi Jianfeng, This is a really hard one. The kernel is very old, and the problem does not sound familiar to me, as a pure TIPC maintainer. However, we do have people working with OpenSAF even in our company, so I will cc your message to one of our guys, just in case it is something he recognizes.
BR ///jon > -----Original Message----- > From: Jianfeng Dong [mailto:[email protected]] > Sent: Wednesday, April 18, 2018 03:23 > To: [email protected] > Subject: [tipc-discussion] TIPC did not recover after a short time network > problem > > Hi, > > We got a TIPC issue in our product, we are using an old kernel(3.10.38) so I > think someone maybe already knew this case and can help us on this issue. > > Our product is a cluster system, has two controller nodes and several payload > nodes. We deploy OpenSAF in our system to manage these nodes, via TIPC. > > Several days ago we rebooted a payload node in the system, after the > reboot the payload got a short-time network chip/driver problem and both > TIPC and other protocol(like TCP) were impacted. > The network recovered immediately, then those programs based on other > protocols like TCP also recovered quickly, but TIPC did not come back until > next reboot. > > Below is the case syslog: > > 1. After the node 'pld0102' rebooted, it succeeded to setup TIPC connetction > with other nodes. > 2018-04-09T09:53:12.705735+08:00 kern.info pld0102 kernel: tipc: Established > link <1.1.2:bond0-1.1.15:eth2> on network plane A > 2018-04-09T09:53:12.705777+08:00 kern.info pld0102 kernel: tipc: Established > link <1.1.2:bond0-1.1.5:bond0> on network plane A > 2018-04-09T09:53:12.705853+08:00 kern.info pld0102 kernel: tipc: Established > link <1.1.2:bond0-1.1.10:bond0> on network plane A > 2018-04-09T09:53:12.706010+08:00 kern.info pld0102 kernel: tipc: Established > link <1.1.2:bond0-1.1.16:eth2> on network plane A > 2018-04-09T09:53:12.706022+08:00 kern.info pld0102 kernel: tipc: Established > link <1.1.2:bond0-1.1.12:bond0> on network plane A > > 2. Several minutes after the rebooting, the network chip/driver had a > problem and recovered immediately, those programs based on TCP/IP > protocol were also impacted and recovered. > 2018-04-09T09:54:28.061865+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return > [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known > hosts.#0 15#015#[email protected]'s > password:] > 2018-04-09T09:54:28.277046+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return > [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known > hosts.#0 15#015#[email protected]'s > password:] > ===>>> THE PAYLOAD NODE 'pld0102' CAN ACCESS THE CONTROLLER NODE > 2018-04-09T09:54:28.377690+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 20 seconds(failure 0, loop count 1) > 2018-04-09T09:54:53.406043+08:00 user.warning pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : fail, return [Error: > ] > ===>>> THE PAYLOAD NODE 'pld0102' COULD NOT ACCESS THE CONTROLLER > NODE SUDDENLY. > 2018-04-09T09:54:53.908054+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 18 seconds(failure 1, loop count 2) > 2018-04-09T09:55:12.040157+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return > [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known > hosts.#0 15#015#[email protected]'s > password:] > ===>>> TCP/IP PROTOCOL RECOVERED AND THEN 'pld0102' CAN CONTINUE > TO ACCESS THE CONTROLLER NODE > 2018-04-09T09:55:12.262501+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return > [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known > hosts.#0 15#015#[email protected]'s > password:] > 2018-04-09T09:55:12.363050+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 15 seconds(failure 0, loop count 3) > 2018-04-09T09:55:27.510388+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return > [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known > hosts.#0 15#015#[email protected]'s > password:] > 2018-04-09T09:55:27.719778+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return > [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known > hosts.#0 15#015#[email protected]'s > password:] > 2018-04-09T09:55:27.820637+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 18 seconds(failure 0, loop count 4) > > 3. However, TIPC in 'pld0102' also got problems 30 seconds around later and > lost contact with all other nodes and didn't recover until next reboot(which > happened at the time 10 minutes later). > 2018-04-09T09:55:42.428828+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.5:bond0>, peer not responding > 2018-04-09T09:55:42.428879+08:00 kern.info pld0102 kernel: tipc: Lost link > <1.1.2:bond0-1.1.5:bond0> on network plane A > 2018-04-09T09:55:42.428892+08:00 kern.info pld0102 kernel: tipc: Lost contact > with <1.1.5> > 2018-04-09T09:55:42.428904+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.10:bond0>, peer not responding > 2018-04-09T09:55:42.428915+08:00 kern.info pld0102 kernel: tipc: Lost link > <1.1.2:bond0-1.1.10:bond0> on network plane A > 2018-04-09T09:55:42.428967+08:00 kern.info pld0102 kernel: tipc: Lost contact > with <1.1.10> > 2018-04-09T09:55:42.428978+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.15:eth2>, peer not responding > 2018-04-09T09:55:42.428984+08:00 kern.info pld0102 kernel: tipc: Lost link > <1.1.2:bond0-1.1.15:eth2> on network plane A > 2018-04-09T09:55:42.428991+08:00 kern.info pld0102 kernel: tipc: Lost contact > with <1.1.15> > 2018-04-09T09:55:42.927546+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.16:eth2>, peer not responding > 2018-04-09T09:55:42.927607+08:00 kern.info pld0102 kernel: tipc: Lost link > <1.1.2:bond0-1.1.16:eth2> on network plane A > 2018-04-09T09:55:42.927621+08:00 kern.info pld0102 kernel: tipc: Lost contact > with <1.1.16> > > > Thanks for any comment, and please let me know if other information is > needed. > > > Regards, > Jianfeng > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most engaging > tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > tipc-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/tipc-discussion ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
