[ https://issues.apache.org/jira/browse/IGNITE-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vyacheslav Daradur updated IGNITE-5580: --------------------------------------- Fix Version/s: (was: 2.1) 2.2 > Improve node failure cause information > -------------------------------------- > > Key: IGNITE-5580 > URL: https://issues.apache.org/jira/browse/IGNITE-5580 > Project: Ignite > Issue Type: Improvement > Components: general > Affects Versions: 1.7 > Reporter: Alexey Goncharuk > Assignee: Vadim Opolski > Labels: observability > Fix For: 2.2 > > > When a node fails, we do not print out any information about the root cause > of this failure. This makes it extremely hard to investigate the failure > causes - I need to find a previous node for the failed node and check the > logs on the previous node. > I suggest that we add extensive information about the reason of the node > failure and the sequence of events that led to this, e.g.: > [time] [NODE] Sending a message to next node - failed _because_ - write > timeout, read timeout, ...? > [time] [NODE] Connection check - failed - why? Connection refused, handshake > timed out, ...? > ... > [time] [NODE] Decided to drop the node because of the sequence above > Maybe we do not need to print out this information always, but we do need > this when troubleshooting logger is enabled. > Also, DiscoverySpi should collect a set of latest important events and dump > these events in case of local node segmentation. This will allow users to > match the events in the cluster and events on local node and get to the > bottom of the failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)