Ok. Is this reproducible? We'll probably need to see all the artifacts (logs / stats / thread dumps) to see if we can figure out what is going on.
Barry Oglesby GemFire Advanced Customer Engineering (ACE) For immediate support please contact Pivotal Support at http://support.pivotal.io/ On Tue, Dec 15, 2015 at 9:44 PM, Hovhannes Antonyan <[email protected]> wrote: > Hi Barry, > > > Yes I am running onMembers API, but as I already said there is no Function > Execution Processor thread that runs that function. > > > ------------------------------ > *From:* Barry Oglesby <[email protected]> > *Sent:* Wednesday, December 16, 2015 12:25 AM > *To:* [email protected] > *Subject:* Re: How to troubleshoot stuck distributed function calls > > I think it depends on how the function is being invoked. Below is an > example with two peers using the onMembers API. If you're invoking your > function differently (e.g. onRegion), let me know. Also, if you want to > send your thread dumps, I can take a look at them. > > I have a test where I have one peer invoking a Function onMembers. If I > put a sleep in the execute method, I see these threads. > > The thread in the caller (in this case the main thread) is waiting for a > reply in ReplyProcessor21.basicWait: > > "main" prio=5 tid=0x00007fd04a008800 nid=0x1903 waiting on condition > [0x0000000108567000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000000010fff1ac0> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282) > at > com.gemstone.gemfire.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:55) > at > com.gemstone.gemfire.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:743) > at > com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:819) > at > com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:796) > at > com.gemstone.gemfire.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:142) > at TestPeer.executeFunctionOnMembers(TestPeer.java:45) > at TestPeer.main(TestPeer.java:28) > > The thread in the member processing the function (a Function Execution > Processor thread) is in the Function.execute method here: > > "Function Execution Processor1" daemon prio=5 tid=0x00007fa694cb3000 > nid=0xc403 waiting on condition [0x000000015f8c6000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at TestFunction.execute(TestFunction.java:13) > at > com.gemstone.gemfire.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:185) > at > com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386) > at > com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at > com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692) > at > com.gemstone.gemfire.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1149) > at java.lang.Thread.run(Thread.java:745) > > > Barry Oglesby > GemFire Advanced Customer Engineering (ACE) > For immediate support please contact Pivotal Support at > http://support.pivotal.io/ > <https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=44iwpeLLhP6gBvkRXzHdVJ1hRqDb5pLiHV1coNmMwEU&s=7GdRrkSBt7z5vo79p_ot6CmW4so73SK9iOgK8axRnng&e=> > > > On Tue, Dec 15, 2015 at 12:05 PM, Hovhannes Antonyan <[email protected] > > wrote: > >> I have dumps of both nodes. Now can you please point to which threads >> should I look at? I do not see any function execution thread on target >> node running that function. >> >> But still the caller node waits for response from that node. Should I >> look at P2P threads next? Something else? >> ------------------------------ >> *From:* Barry Oglesby <[email protected]> >> *Sent:* Tuesday, December 15, 2015 11:37 PM >> *To:* [email protected] >> *Subject:* Re: How to troubleshoot stuck distributed function calls >> >> You'll want to take thread dumps (not heap dumps) in the members >> especially the one that initiated the function call and the one that didn't >> send a response. Those will tell you whether the thread processing the >> function or the thread processing the reply is stuck and if so, where. >> >> Barry Oglesby >> GemFire Advanced Customer Engineering (ACE) >> For immediate support please contact Pivotal Support at >> http://support.pivotal.io/ >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=QfLGTYeWOQDhTSy6a8t-VHVYIdCKpVNoLLpF-X-b054&s=3nG5HwNACSIP-F00MISsSsTioz5EtMo-u4jx8w_yjek&e=> >> >> >> On Tue, Dec 15, 2015 at 11:23 AM, Hovhannes Antonyan < >> [email protected]> wrote: >> >>> I was looking at the heapdump and identified the node which didn't sent >>> the response. >>> >>> But the question now is why didn't it send it, did it run the function >>> or not yet...? >>> ------------------------------ >>> *From:* Darrel Schneider <[email protected]> >>> *Sent:* Tuesday, December 15, 2015 9:58 PM >>> *To:* [email protected] >>> *Subject:* Re: How to troubleshoot stuck distributed function calls >>> >>> Usually the member waiting for a response logs a warning that it has >>> been waiting for longer than 15 seconds from a particular member. Use that >>> member id to identify the member that is not responding. Get a stack dump >>> on that member and look for a thread that is processing the unresponsive >>> message. Sometimes this member also logs that he is waiting for someone >>> else to respond to him before he can respond to the first member. >>> >>> The log message to look for is: "seconds have elapsed while waiting for >>> replies:". It will be a warning and should be the last message logged by >>> that thread. Sometimes it will log this warning and then get the response >>> later in which case it will log an info message that it did receive the >>> reply. >>> >>> >>> On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan < >>> [email protected]> wrote: >>> >>>> Hello experts, >>>> >>>> >>>> >>>> I have a multi node environment where one of the nodes has made a >>>> broadcast call to all other nodes and got stuck. >>>> >>>> It is still waiting responses from all nodes and from the heapdump I >>>> see that ResultCollector has N-1 elements, where N is the total number of >>>> nodes, so it looks like one of the nodes didn't return a response, or it >>>> did return but for some reason the caller has not received it. >>>> >>>> How can I troubleshoot this issue, how can I know which node exactly >>>> has failed to return the response and why? >>>> >>>> >>>> >>>> Thanks in advance, >>>> >>>> Hovhannes >>>> >>> >>> >> >
