Re: How to troubleshoot stuck distributed function calls

Barry Oglesby Wed, 16 Dec 2015 09:13:28 -0800

Ok. Is this reproducible? We'll probably need to see all the artifacts
(logs / stats / thread dumps) to see if we can figure out what is going on.


Barry Oglesby
GemFire Advanced Customer Engineering (ACE)
For immediate support please contact Pivotal Support at
http://support.pivotal.io/


On Tue, Dec 15, 2015 at 9:44 PM, Hovhannes Antonyan <[email protected]>
wrote:

> Hi Barry,
>
>
> Yes I am running onMembers API, but as I already said there is no Function
> Execution Processor thread that runs that function.
>
>
> ------------------------------
> *From:* Barry Oglesby <[email protected]>
> *Sent:* Wednesday, December 16, 2015 12:25 AM
> *To:* [email protected]
> *Subject:* Re: How to troubleshoot stuck distributed function calls
>
> I think it depends on how the function is being invoked. Below is an
> example with two peers using the onMembers API. If you're invoking your
> function differently (e.g. onRegion), let me know. Also, if you want to
> send your thread dumps, I can take a look at them.
>
> I have a test where I have one peer invoking a Function onMembers. If I
> put a sleep in the execute method, I see these threads.
>
> The thread in the caller (in this case the main thread) is waiting for a
> reply in ReplyProcessor21.basicWait:
>
> "main" prio=5 tid=0x00007fd04a008800 nid=0x1903 waiting on condition
> [0x0000000108567000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000000010fff1ac0> (a
> java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282)
> at
> com.gemstone.gemfire.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:55)
> at
> com.gemstone.gemfire.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:743)
> at
> com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:819)
> at
> com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:796)
> at
> com.gemstone.gemfire.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:142)
> at TestPeer.executeFunctionOnMembers(TestPeer.java:45)
> at TestPeer.main(TestPeer.java:28)
>
> The thread in the member processing the function (a Function Execution
> Processor thread) is in the Function.execute method here:
>
> "Function Execution Processor1" daemon prio=5 tid=0x00007fa694cb3000
> nid=0xc403 waiting on condition [0x000000015f8c6000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at TestFunction.execute(TestFunction.java:13)
> at
> com.gemstone.gemfire.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:185)
> at
> com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
> at
> com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
> at
> com.gemstone.gemfire.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1149)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Barry Oglesby
> GemFire Advanced Customer Engineering (ACE)
> For immediate support please contact Pivotal Support at
> http://support.pivotal.io/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=44iwpeLLhP6gBvkRXzHdVJ1hRqDb5pLiHV1coNmMwEU&s=7GdRrkSBt7z5vo79p_ot6CmW4so73SK9iOgK8axRnng&e=>
>
>
> On Tue, Dec 15, 2015 at 12:05 PM, Hovhannes Antonyan <[email protected]
> > wrote:
>
>> I have dumps of both nodes. Now can you please point to which threads
>> should I look at? I do not see any function execution thread on target
>> node running that function.
>>
>> But still the caller node waits for response from that node. Should I
>> look at P2P threads next? Something else?
>> ------------------------------
>> *From:* Barry Oglesby <[email protected]>
>> *Sent:* Tuesday, December 15, 2015 11:37 PM
>> *To:* [email protected]
>> *Subject:* Re: How to troubleshoot stuck distributed function calls
>>
>> You'll want to take thread dumps (not heap dumps) in the members
>> especially the one that initiated the function call and the one that didn't
>> send a response. Those will tell you whether the thread processing the
>> function or the thread processing the reply is stuck and if so, where.
>>
>> Barry Oglesby
>> GemFire Advanced Customer Engineering (ACE)
>> For immediate support please contact Pivotal Support at
>> http://support.pivotal.io/
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=QfLGTYeWOQDhTSy6a8t-VHVYIdCKpVNoLLpF-X-b054&s=3nG5HwNACSIP-F00MISsSsTioz5EtMo-u4jx8w_yjek&e=>
>>
>>
>> On Tue, Dec 15, 2015 at 11:23 AM, Hovhannes Antonyan <
>> [email protected]> wrote:
>>
>>> I was looking at the heapdump and identified the node which didn't sent
>>> the response.
>>>
>>> But the question now is why didn't it send it, did it run the function
>>> or not yet...?
>>> ------------------------------
>>> *From:* Darrel Schneider <[email protected]>
>>> *Sent:* Tuesday, December 15, 2015 9:58 PM
>>> *To:* [email protected]
>>> *Subject:* Re: How to troubleshoot stuck distributed function calls
>>>
>>> Usually the member waiting for a response logs a warning that it has
>>> been waiting for longer than 15 seconds from a particular member. Use that
>>> member id to identify the member that is not responding. Get a stack dump
>>> on that member and look for a thread that is processing the unresponsive
>>> message. Sometimes this member also logs that he is waiting for someone
>>> else to respond to him before he can respond to the first member.
>>>
>>> The log message to look for is: "seconds have elapsed while waiting for
>>> replies:". It will be a warning and should be the last message logged by
>>> that thread. Sometimes it will log this warning and then get the response
>>> later in which case it will log an info message that it did receive the
>>> reply.
>>>
>>>
>>> On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <
>>> [email protected]> wrote:
>>>
>>>> Hello experts,
>>>>
>>>>
>>>>
>>>> I have a multi node environment where one of the nodes has made a
>>>> broadcast call to all other nodes and got stuck.
>>>>
>>>> It is still waiting responses from all nodes and from the heapdump I
>>>> see that ResultCollector has N-1 elements, where N is the total number of
>>>> nodes, so it looks like one of the nodes didn't return a response, or it
>>>> did return but for some reason the caller has not received it.
>>>>
>>>> How can I troubleshoot this issue, how can I know which node exactly
>>>> has failed to return the response and why?
>>>>
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Hovhannes
>>>>
>>>
>>>
>>
>

Re: How to troubleshoot stuck distributed function calls

Reply via email to