Hi, We got a TIPC issue in our product, we are using an old kernel(3.10.38) so I think someone maybe already knew this case and can help us on this issue.
Our product is a cluster system, has two controller nodes and several payload nodes. We deploy OpenSAF in our system to manage these nodes, via TIPC. Several days ago we rebooted a payload node in the system, after the reboot the payload got a short-time network chip/driver problem and both TIPC and other protocol(like TCP) were impacted. The network recovered immediately, then those programs based on other protocols like TCP also recovered quickly, but TIPC did not come back until next reboot. Below is the case syslog: 1. After the node 'pld0102' rebooted, it succeeded to setup TIPC connetction with other nodes. 2018-04-09T09:53:12.705735+08:00 kern.info pld0102 kernel: tipc: Established link <1.1.2:bond0-1.1.15:eth2> on network plane A 2018-04-09T09:53:12.705777+08:00 kern.info pld0102 kernel: tipc: Established link <1.1.2:bond0-1.1.5:bond0> on network plane A 2018-04-09T09:53:12.705853+08:00 kern.info pld0102 kernel: tipc: Established link <1.1.2:bond0-1.1.10:bond0> on network plane A 2018-04-09T09:53:12.706010+08:00 kern.info pld0102 kernel: tipc: Established link <1.1.2:bond0-1.1.16:eth2> on network plane A 2018-04-09T09:53:12.706022+08:00 kern.info pld0102 kernel: tipc: Established link <1.1.2:bond0-1.1.12:bond0> on network plane A 2. Several minutes after the rebooting, the network chip/driver had a problem and recovered immediately, those programs based on TCP/IP protocol were also impacted and recovered. 2018-04-09T09:54:28.061865+08:00 user.info pld0102 AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known hosts.#0 15#015#[email protected]'s password:] 2018-04-09T09:54:28.277046+08:00 user.info pld0102 AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known hosts.#0 15#015#[email protected]'s password:] ===>>> THE PAYLOAD NODE 'pld0102' CAN ACCESS THE CONTROLLER NODE 2018-04-09T09:54:28.377690+08:00 user.info pld0102 AutoRecoverReloadFail.py: sleep for 20 seconds(failure 0, loop count 1) 2018-04-09T09:54:53.406043+08:00 user.warning pld0102 AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : fail, return [Error: ] ===>>> THE PAYLOAD NODE 'pld0102' COULD NOT ACCESS THE CONTROLLER NODE SUDDENLY. 2018-04-09T09:54:53.908054+08:00 user.info pld0102 AutoRecoverReloadFail.py: sleep for 18 seconds(failure 1, loop count 2) 2018-04-09T09:55:12.040157+08:00 user.info pld0102 AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known hosts.#0 15#015#[email protected]'s password:] ===>>> TCP/IP PROTOCOL RECOVERED AND THEN 'pld0102' CAN CONTINUE TO ACCESS THE CONTROLLER NODE 2018-04-09T09:55:12.262501+08:00 user.info pld0102 AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known hosts.#0 15#015#[email protected]'s password:] 2018-04-09T09:55:12.363050+08:00 user.info pld0102 AutoRecoverReloadFail.py: sleep for 15 seconds(failure 0, loop count 3) 2018-04-09T09:55:27.510388+08:00 user.info pld0102 AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known hosts.#0 15#015#[email protected]'s password:] 2018-04-09T09:55:27.719778+08:00 user.info pld0102 AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known hosts.#0 15#015#[email protected]'s password:] 2018-04-09T09:55:27.820637+08:00 user.info pld0102 AutoRecoverReloadFail.py: sleep for 18 seconds(failure 0, loop count 4) 3. However, TIPC in 'pld0102' also got problems 30 seconds around later and lost contact with all other nodes and didn't recover until next reboot(which happened at the time 10 minutes later). 2018-04-09T09:55:42.428828+08:00 kern.warning pld0102 kernel: tipc: Resetting link <1.1.2:bond0-1.1.5:bond0>, peer not responding 2018-04-09T09:55:42.428879+08:00 kern.info pld0102 kernel: tipc: Lost link <1.1.2:bond0-1.1.5:bond0> on network plane A 2018-04-09T09:55:42.428892+08:00 kern.info pld0102 kernel: tipc: Lost contact with <1.1.5> 2018-04-09T09:55:42.428904+08:00 kern.warning pld0102 kernel: tipc: Resetting link <1.1.2:bond0-1.1.10:bond0>, peer not responding 2018-04-09T09:55:42.428915+08:00 kern.info pld0102 kernel: tipc: Lost link <1.1.2:bond0-1.1.10:bond0> on network plane A 2018-04-09T09:55:42.428967+08:00 kern.info pld0102 kernel: tipc: Lost contact with <1.1.10> 2018-04-09T09:55:42.428978+08:00 kern.warning pld0102 kernel: tipc: Resetting link <1.1.2:bond0-1.1.15:eth2>, peer not responding 2018-04-09T09:55:42.428984+08:00 kern.info pld0102 kernel: tipc: Lost link <1.1.2:bond0-1.1.15:eth2> on network plane A 2018-04-09T09:55:42.428991+08:00 kern.info pld0102 kernel: tipc: Lost contact with <1.1.15> 2018-04-09T09:55:42.927546+08:00 kern.warning pld0102 kernel: tipc: Resetting link <1.1.2:bond0-1.1.16:eth2>, peer not responding 2018-04-09T09:55:42.927607+08:00 kern.info pld0102 kernel: tipc: Lost link <1.1.2:bond0-1.1.16:eth2> on network plane A 2018-04-09T09:55:42.927621+08:00 kern.info pld0102 kernel: tipc: Lost contact with <1.1.16> Thanks for any comment, and please let me know if other information is needed. Regards, Jianfeng ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
