[
https://issues.apache.org/jira/browse/FLUME-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prasad Mujumdar updated FLUME-768:
----------------------------------
Attachment: Flume-768.patch
Attaching the patch with approved review changes
> Agent deadlock possible due to blocked latch in driver thread.
> --------------------------------------------------------------
>
> Key: FLUME-768
> URL: https://issues.apache.org/jira/browse/FLUME-768
> Project: Flume
> Issue Type: Bug
> Components: Node
> Affects Versions: v0.9.4
> Reporter: Jonathan Hsieh
> Assignee: Prasad Mujumdar
> Fix For: v0.9.5
>
> Attachments: Flume-768.patch, Flume-768.patch
>
>
> There are three threads essentially blocked. 2 of the three are blocked
> because of the 3rd.
> The main problem is that roll close is blocked attempting for a close to
> complete. It has a subordinate thread that seems to be gone normally
> triggers the latch that allows it to close. My guess is some exception in
> that TriggerThread exited and because the latch countdowns aren't present,
> the ok to shutdown latch never got cleared.
> The other two threads are blocked because this -- and likely wouldn't get
> stuck here if that intermediate threads wasn't stuck.
> The agent's avro source queue is full and it is blocked trying to enqueue
> more data.
> There is also another thread that is blocked -- it is wal draining thread is
> blocked with nothing left to do (why everything is in sent state). This
> doesn't seem to be part of the problem.
> Thread 21 (448511246@qtp-1388647956-1):
> State: WAITING
> Blocked count: 3
> Waited count: 29
> Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@11031d18
> Stack:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:306)
>
> com.cloudera.flume.handlers.avro.AvroEventSource.enqueue(AvroEventSource.java:114)
>
> com.cloudera.flume.handlers.avro.AvroEventSource$1.append(AvroEventSource.java:135)
> sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> java.lang.reflect.Method.invoke(Method.java:597)
>
> org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:93)
> org.apache.avro.ipc.Responder.respond(Responder.java:136)
> org.apache.avro.ipc.Responder.respond(Responder.java:88)
> org.apache.avro.ipc.ResponderServlet.doPost(ResponderServlet.java:48)
> javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
> javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> org.mortbay.jetty.Server.handle(Server.java:326)
> Here's another thread that is essentially blocked:
> Thread 19 (logicalNode agent-19):
> State: WAITING
> Blocked count: 83
> Waited count: 1143043
> Waiting on java.util.concurrent.CountDownLatch$Sync@5c328896
> Stack:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
> com.cloudera.flume.handlers.rolling.RollSink.close(RollSink.java:213)
>
> com.cloudera.flume.agent.durability.NaiveFileWALDeco.close(NaiveFileWALDeco.java:147)
> com.cloudera.flume.agent.AgentSink.close(AgentSink.java:118)
>
> com.cloudera.flume.core.EventSinkDecorator.close(EventSinkDecorator.java:67)
>
> com.cloudera.flume.handlers.debug.LazyOpenDecorator.close(LazyOpenDecorator.java:81)
>
> com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:121)
> Here's the wal draining thread trying to pull things out of the wal.
> Thread 24 (naive file wal transmit-24):
> State: TIMED_WAITING
> Blocked count: 156
> Waited count: 171352
> Stack:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
>
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
>
> com.cloudera.flume.agent.durability.NaiveFileWALManager.getUnackedSource(NaiveFileWALManager.java:763)
> com.cloudera.flume.agent.durability.WALSource.next(WALSource.java:104)
>
> com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:91
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira