Usually the member waiting for a response logs a warning that it has been
waiting for longer than 15 seconds from a particular member. Use that
member id to identify the member that is not responding. Get a stack dump
on that member and look for a thread that is processing the unresponsive
message. Sometimes this member also logs that he is waiting for someone
else to respond to him before he can respond to the first member.

The log message to look for is: "seconds have elapsed while waiting for
replies:". It will be a warning and should be the last message logged by
that thread. Sometimes it will log this warning and then get the response
later in which case it will log an info message that it did receive the
reply.


On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <hanton...@vmware.com>
wrote:

> Hello experts,
>
>
>
> I have a multi node environment where one of the nodes has made a
> broadcast call to all other nodes and got stuck.
>
> It is still waiting responses from all nodes and from the heapdump I see
> that ResultCollector has N-1 elements, where N is the total number of
> nodes, so it looks like one of the nodes didn't return a response, or it
> did return but for some reason the caller has not received it.
>
> How can I troubleshoot this issue, how can I know which node exactly has
> failed to return the response and why?
>
>
>
> Thanks in advance,
>
> Hovhannes
>

Reply via email to