Agent deadlock possible due to blocked latch in driver thread.
--------------------------------------------------------------

                 Key: FLUME-768
                 URL: https://issues.apache.org/jira/browse/FLUME-768
             Project: Flume
          Issue Type: Bug
          Components: Node
    Affects Versions: v0.9.4
            Reporter: Jonathan Hsieh
             Fix For: v0.9.5


There are three threads essentially blocked. 2 of the three are blocked because 
of the 3rd.  

The main problem is that roll close is blocked attempting for a close to 
complete.  It has a subordinate thread that seems to be gone normally triggers 
the latch that allows it to close.  My guess is some exception in that 
TriggerThread exited and because the latch countdowns aren't present, the ok to 
shutdown latch never got cleared.

The other two threads are blocked because this -- and likely wouldn't get stuck 
here if that intermediate threads wasn't stuck.

The agent's avro source queue is full and it is blocked trying to enqueue more 
data.

There is also another thread that is blocked -- it is wal draining thread is 
blocked with nothing left to do (why everything is in sent state).  This 
doesn't seem to be part of the problem.

Thread 21 (448511246@qtp-1388647956-1):
  State: WAITING
  Blocked count: 3
  Waited count: 29
  Waiting on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@11031d18
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
    
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
    java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:306)
    
com.cloudera.flume.handlers.avro.AvroEventSource.enqueue(AvroEventSource.java:114)
    
com.cloudera.flume.handlers.avro.AvroEventSource$1.append(AvroEventSource.java:135)
    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
    
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    
org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:93)
    org.apache.avro.ipc.Responder.respond(Responder.java:136)
    org.apache.avro.ipc.Responder.respond(Responder.java:88)
    org.apache.avro.ipc.ResponderServlet.doPost(ResponderServlet.java:48)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
    org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
    org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
    org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    org.mortbay.jetty.Server.handle(Server.java:326)
Here's another thread that is essentially blocked:
Thread 19 (logicalNode agent-19):
  State: WAITING
  Blocked count: 83
  Waited count: 1143043
  Waiting on java.util.concurrent.CountDownLatch$Sync@5c328896
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
    
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
    
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
    java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
    com.cloudera.flume.handlers.rolling.RollSink.close(RollSink.java:213)
    
com.cloudera.flume.agent.durability.NaiveFileWALDeco.close(NaiveFileWALDeco.java:147)
    com.cloudera.flume.agent.AgentSink.close(AgentSink.java:118)
    com.cloudera.flume.core.EventSinkDecorator.close(EventSinkDecorator.java:67)
    
com.cloudera.flume.handlers.debug.LazyOpenDecorator.close(LazyOpenDecorator.java:81)
    
com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:121)
Here's the wal draining thread trying to pull things out of the wal.
Thread 24 (naive file wal transmit-24):
  State: TIMED_WAITING
  Blocked count: 156
  Waited count: 171352
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
    
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
    java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
    
com.cloudera.flume.agent.durability.NaiveFileWALManager.getUnackedSource(NaiveFileWALManager.java:763)
    com.cloudera.flume.agent.durability.WALSource.next(WALSource.java:104)
    
com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:91

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to