[ https://issues.apache.org/jira/browse/NIFI-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580923#comment-17580923 ]
ASF subversion and git services commented on NIFI-10052: -------------------------------------------------------- Commit 2685856c629aa1bda20019981ed932ccecf9415a in nifi's branch refs/heads/main from Hsin-Ying Lee [ https://gitbox.apache.org/repos/asf?p=nifi.git;h=2685856c62 ] NIFI-10052 Avoid obtaining any locks when creating/sending heartbeats (#6298) > Avoid obtaining any locks when creating/sending heartbeats > ---------------------------------------------------------- > > Key: NIFI-10052 > URL: https://issues.apache.org/jira/browse/NIFI-10052 > Project: Apache NiFi > Issue Type: Improvement > Components: Core Framework > Reporter: Mark Payne > Assignee: Hsin-Ying Lee > Priority: Major > Labels: cluster, heartbeat, stability > Time Spent: 0.5h > Remaining Estimate: 0h > > When NiFi creates a heartbeat to send to the coordinator, it must obtain a > few locks in order to generate that heartbeat. We should avoid obtaining any > read locks, write locks, or synchronized monitors, especially those that may > be held for a while. Doing so can result in NiFi getting disconnected from > the cluster if a write lock is held for a long time. > Specifically, the following locks are obtained, at minimum: > * FlowController readLock in the createHeartbeatMessage() method. Due to > refactoring, this read lock is not necessary at all. > * revisionManager.getRevisionUpdateCount() is synchronized. However, the > synchronization here is not needed, as it just returns an AtomicLong.get(). > This is perhaps the most important lock to avoid because any update to a > component or group of components happens within > revisionManager.updateRevision, which also is synchronized. So a large > request like deleting thousands of components will block heartbeats from > being created until this completes. > * FlowController.getTotalFlowFileCount - this may be the most challenging to > eliminate. It calls ProcessGroup.getConnections() and > ProcessGroup.getProcessGroups(), which means that it must obtain the read > lock of the Process Group twice - for every Process Group in the flow. We may > be able to change StandardProcessGroup's connections and processGroups maps > to ConcurrentHashMap's and just introduce a getQueueSize() method on > ProcessGroup that can avoid having to lock so much > * This createHeartbeatMessage() method also appears to reference > FlowController's {{connectionStatus}} member variable without any locks, > although it is not volatile and documentation indicates that it's guarded by > read/write lock. So that needs to be addressed in order to ensure that the > connectionStatus is always accurately reported. -- This message was sent by Atlassian Jira (v8.20.10#820010)