Re: [tipc-discussion] TIPC did not recover after a short time network problem

Jon Maloy Wed, 18 Apr 2018 12:34:53 -0700

Hi Jianfeng,
This is a really hard one. The kernel is very old, and the problem does not 
sound familiar to me, as a pure TIPC maintainer.
However, we do have people working with OpenSAF even in our company, so I will 
cc your message to one of our guys, just in case it is something he recognizes.


BR
///jon


> -----Original Message-----
> From: Jianfeng Dong [mailto:[email protected]]
> Sent: Wednesday, April 18, 2018 03:23
> To: [email protected]
> Subject: [tipc-discussion] TIPC did not recover after a short time network
> problem
> 
> Hi,
> 
> We got a TIPC issue in our product, we are using an old kernel(3.10.38) so I
> think someone maybe already knew this case and can help us on this issue.
> 
> Our product is a cluster system, has two controller nodes and several payload
> nodes. We deploy OpenSAF in our system to manage these nodes, via TIPC.
> 
> Several days ago we rebooted a payload node in the system, after the
> reboot the payload got a short-time network chip/driver problem and both
> TIPC and other protocol(like TCP) were impacted.
> The network recovered immediately, then those programs based on other
> protocols like TCP also recovered quickly, but TIPC did not come back until
> next reboot.
> 
> Below is the case syslog:
> 
> 1. After the node 'pld0102' rebooted, it succeeded to setup TIPC connetction
> with other nodes.
> 2018-04-09T09:53:12.705735+08:00 kern.info pld0102 kernel: tipc: Established
> link <1.1.2:bond0-1.1.15:eth2> on network plane A
> 2018-04-09T09:53:12.705777+08:00 kern.info pld0102 kernel: tipc: Established
> link <1.1.2:bond0-1.1.5:bond0> on network plane A
> 2018-04-09T09:53:12.705853+08:00 kern.info pld0102 kernel: tipc: Established
> link <1.1.2:bond0-1.1.10:bond0> on network plane A
> 2018-04-09T09:53:12.706010+08:00 kern.info pld0102 kernel: tipc: Established
> link <1.1.2:bond0-1.1.16:eth2> on network plane A
> 2018-04-09T09:53:12.706022+08:00 kern.info pld0102 kernel: tipc: Established
> link <1.1.2:bond0-1.1.12:bond0> on network plane A
> 
> 2. Several minutes after the rebooting, the network chip/driver had a
> problem and recovered immediately, those programs based on TCP/IP
> protocol were also impacted and recovered.
> 2018-04-09T09:54:28.061865+08:00 user.info pld0102
> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return
> [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known
> hosts.#0                              15#015#[email protected]'s 
> password:]
> 2018-04-09T09:54:28.277046+08:00 user.info pld0102
> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return
> [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known
> hosts.#0                              15#015#[email protected]'s 
> password:]
> ===>>> THE PAYLOAD NODE 'pld0102' CAN ACCESS THE CONTROLLER NODE
> 2018-04-09T09:54:28.377690+08:00 user.info pld0102
> AutoRecoverReloadFail.py: sleep for 20 seconds(failure 0, loop count 1)
> 2018-04-09T09:54:53.406043+08:00 user.warning pld0102
> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : fail, return [Error: 
> ]
> ===>>> THE PAYLOAD NODE 'pld0102' COULD NOT ACCESS THE CONTROLLER
> NODE SUDDENLY.
> 2018-04-09T09:54:53.908054+08:00 user.info pld0102
> AutoRecoverReloadFail.py: sleep for 18 seconds(failure 1, loop count 2)
> 2018-04-09T09:55:12.040157+08:00 user.info pld0102
> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return
> [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known
> hosts.#0                              15#015#[email protected]'s 
> password:]
> ===>>> TCP/IP PROTOCOL RECOVERED AND THEN 'pld0102' CAN CONTINUE
> TO ACCESS THE CONTROLLER NODE
> 2018-04-09T09:55:12.262501+08:00 user.info pld0102
> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return
> [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known
> hosts.#0                              15#015#[email protected]'s 
> password:]
> 2018-04-09T09:55:12.363050+08:00 user.info pld0102
> AutoRecoverReloadFail.py: sleep for 15 seconds(failure 0, loop count 3)
> 2018-04-09T09:55:27.510388+08:00 user.info pld0102
> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, return
> [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known
> hosts.#0                              15#015#[email protected]'s 
> password:]
> 2018-04-09T09:55:27.719778+08:00 user.info pld0102
> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, return
> [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known
> hosts.#0                              15#015#[email protected]'s 
> password:]
> 2018-04-09T09:55:27.820637+08:00 user.info pld0102
> AutoRecoverReloadFail.py: sleep for 18 seconds(failure 0, loop count 4)
> 
> 3. However, TIPC in 'pld0102' also got problems 30 seconds around later and
> lost contact with all other nodes and didn't recover until next reboot(which
> happened at the time 10 minutes later).
> 2018-04-09T09:55:42.428828+08:00 kern.warning pld0102 kernel: tipc:
> Resetting link <1.1.2:bond0-1.1.5:bond0>, peer not responding
> 2018-04-09T09:55:42.428879+08:00 kern.info pld0102 kernel: tipc: Lost link
> <1.1.2:bond0-1.1.5:bond0> on network plane A
> 2018-04-09T09:55:42.428892+08:00 kern.info pld0102 kernel: tipc: Lost contact
> with <1.1.5>
> 2018-04-09T09:55:42.428904+08:00 kern.warning pld0102 kernel: tipc:
> Resetting link <1.1.2:bond0-1.1.10:bond0>, peer not responding
> 2018-04-09T09:55:42.428915+08:00 kern.info pld0102 kernel: tipc: Lost link
> <1.1.2:bond0-1.1.10:bond0> on network plane A
> 2018-04-09T09:55:42.428967+08:00 kern.info pld0102 kernel: tipc: Lost contact
> with <1.1.10>
> 2018-04-09T09:55:42.428978+08:00 kern.warning pld0102 kernel: tipc:
> Resetting link <1.1.2:bond0-1.1.15:eth2>, peer not responding
> 2018-04-09T09:55:42.428984+08:00 kern.info pld0102 kernel: tipc: Lost link
> <1.1.2:bond0-1.1.15:eth2> on network plane A
> 2018-04-09T09:55:42.428991+08:00 kern.info pld0102 kernel: tipc: Lost contact
> with <1.1.15>
> 2018-04-09T09:55:42.927546+08:00 kern.warning pld0102 kernel: tipc:
> Resetting link <1.1.2:bond0-1.1.16:eth2>, peer not responding
> 2018-04-09T09:55:42.927607+08:00 kern.info pld0102 kernel: tipc: Lost link
> <1.1.2:bond0-1.1.16:eth2> on network plane A
> 2018-04-09T09:55:42.927621+08:00 kern.info pld0102 kernel: tipc: Lost contact
> with <1.1.16>
> 
> 
> Thanks for any comment, and please let me know if other information is
> needed.
> 
> 
> Regards,
> Jianfeng
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most engaging
> tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] TIPC did not recover after a short time network problem

Reply via email to