[ https://issues.apache.org/jira/browse/IGNITE-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16306228#comment-16306228 ]
Ryabov Dmitrii commented on IGNITE-5580: ---------------------------------------- [~agoncharuk], I used {{TcpDiscoveryNodeFailedMessage.warning(String)}} to send message about failure. This warning is logged by {{IgniteUtils}} logger during processing node failed message, but was used only in 2 cases ({{TcpCommunicationSpi.checkClientQueueSize()}} and {{.createTcpClient()}}). Is this form of logging good? Do we need more detailed messages? Also when node fails I log latest events on all nodes. > Improve node failure cause information > -------------------------------------- > > Key: IGNITE-5580 > URL: https://issues.apache.org/jira/browse/IGNITE-5580 > Project: Ignite > Issue Type: Improvement > Components: general > Affects Versions: 1.7 > Reporter: Alexey Goncharuk > Assignee: Ryabov Dmitrii > Labels: observability > > When a node fails, we do not print out any information about the root cause > of this failure. This makes it extremely hard to investigate the failure > causes - I need to find a previous node for the failed node and check the > logs on the previous node. > I suggest that we add extensive information about the reason of the node > failure and the sequence of events that led to this, e.g.: > [time] [NODE] Sending a message to next node - failed _because_ - write > timeout, read timeout, ...? > [time] [NODE] Connection check - failed - why? Connection refused, handshake > timed out, ...? > ... > [time] [NODE] Decided to drop the node because of the sequence above > Maybe we do not need to print out this information always, but we do need > this when troubleshooting logger is enabled. > Also, DiscoverySpi should collect a set of latest important events and dump > these events in case of local node segmentation. This will allow users to > match the events in the cluster and events on local node and get to the > bottom of the failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)