Usually the member waiting for a response logs a warning that it has been waiting for longer than 15 seconds from a particular member. Use that member id to identify the member that is not responding. Get a stack dump on that member and look for a thread that is processing the unresponsive message. Sometimes this member also logs that he is waiting for someone else to respond to him before he can respond to the first member.
The log message to look for is: "seconds have elapsed while waiting for replies:". It will be a warning and should be the last message logged by that thread. Sometimes it will log this warning and then get the response later in which case it will log an info message that it did receive the reply. On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <hanton...@vmware.com> wrote: > Hello experts, > > > > I have a multi node environment where one of the nodes has made a > broadcast call to all other nodes and got stuck. > > It is still waiting responses from all nodes and from the heapdump I see > that ResultCollector has N-1 elements, where N is the total number of > nodes, so it looks like one of the nodes didn't return a response, or it > did return but for some reason the caller has not received it. > > How can I troubleshoot this issue, how can I know which node exactly has > failed to return the response and why? > > > > Thanks in advance, > > Hovhannes >