[
https://issues.apache.org/jira/browse/NIFI-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071738#comment-15071738
]
Tony Kurc commented on NIFI-1333:
---------------------------------
[~ozhurakousky] - I just want to restate what you're describing, as I believe
I've reached the same conclusion as you. The behavior you're describing is that
the shutdown sequence begins, hits this line [1] and there are still tasks
running. These tasks will not terminate if they call methods that wait for that
lock ( which is held by a thread waiting for them to shutdown (a.k.a., party
foul)
On to the PR, the substantive change is the removal the writeLock.lock from
shutdown, There are no comments on why that lock is acquired which is a bit
distressing (and the code looks like it was there from the initial import, so
"blame" does nothing). I'm a bit concerned that the lock, even in the buggy
state, was providing some sort of latching. [~ozhurakousky] - were you able to
unwind all the uses of that lock? This line scared me a bit [2] if it was
acting as a latch. I started to on paper, and it got messy.
[~joewitt] or [~markap14] - as this is old code, do you have any idea clue what
rationale went into that writeLock.lock?
[1]
https://github.com/apache/nifi/blob/11768cc38827023e9a04b65ba5d357bb692a1d10/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/FlowController.java#L1087
[2]
https://github.com/apache/nifi/blame/11768cc38827023e9a04b65ba5d357bb692a1d10/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/FlowController.java#L1131
> FlowController fails to shut down gracefully even though there is nothing
> going on in the flow
> ----------------------------------------------------------------------------------------------
>
> Key: NIFI-1333
> URL: https://issues.apache.org/jira/browse/NIFI-1333
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 0.4.1
> Reporter: Oleg Zhurakousky
> Assignee: Oleg Zhurakousky
> Priority: Trivial
> Fix For: 0.5.0
>
>
> Basically the following test fails:
> https://github.com/olegz/nifi/blob/int-test/nifi-integration-tests/src/test/java/org/apache/nifi/test/flowcontroll/FlowControllerTests.java#L50
> even though there is no compelling reason for it to fail based on what's in
> the flow.
> Also, the message in logs is confusing . . .
> {code}
> Initiated graceful shutdown of flow controller...waiting up to 10 seconds
> 2015-12-23 15:19:11,977 WARN [main] o.apache.nifi.controller.FlowController
> Controller hasn't terminated properly. There exists an uninterruptable
> thread that will take an indeterminate amount of time to stop. Might need to
> kill the program manually.
> {code}
> What actually happens is deadlock during the shutdown.
> Below are the relevant jstack:
> {code}
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007aeb20988> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
> at
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
> at
> org.apache.nifi.controller.FlowController.shutdown(FlowController.java:1124)
> at org.apache.nifi.test.s2s.SiteToSiteTests.bar(SiteToSiteTests.java:75)
> . . .
> "Framework Task Thread Thread-1" prio=5 tid=0x00007fc8a2064800 nid=0x6a03
> waiting on condition [0x0000700001ded000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007aeb20288> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
> at
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
> at
> org.apache.nifi.controller.FlowController.getRootGroupId(FlowController.java:1262)
> at
> org.apache.nifi.controller.tasks.ExpireFlowFiles.run(ExpireFlowFiles.java:54)
> . . .
> "Timer-Driven Process Thread-1" prio=5 tid=0x00007fc8a3146800 nid=0x6c03
> waiting on condition [0x0000700001ef0000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007aeb20288> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
> at
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
> at
> org.apache.nifi.controller.FlowController.isClustered(FlowController.java:2984)
> at
> org.apache.nifi.controller.FlowController.heartbeat(FlowController.java:3444)
> {code}
> The issue the way I see it is that FlowController's _shutdown_ routine is
> synchronized under the same lock as most of the FlowController callbacks made
> by other threads, hence those threads can't be shutdown since they are in
> dead-lock.
> I don't think there is any reason to synchronize the the shutdown routine
> since all we are trying to do is shut down the very same threads that are
> blocking. Removing synchronization resolves the issue.
> Will submit a patch in a few
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)