[
https://issues.apache.org/jira/browse/NIFI-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941078#comment-15941078
]
Mark Payne commented on NIFI-3257:
----------------------------------
Using the -XX:+PrintGC JVM option and using YourKit to profile garbage
collection, along with significant DEBUG logging that I've added, I am seeing
that the problem largely is due to excessive GC runs. I'm seeing up to 25% of
my JVM time spent performing Garbage Collection. I've marked this ticket as
being related to NIFI-3636 and NIFI-3648 because these tickets are intended to
address the heavy garbage collection.
I've also found that why we use multiple threads to replicate REST API calls,
we do not read or merge node responses in parallel. This is done serially after
all "Response" objects have been obtained. This is very inefficient and can
result in very long request replication times. It could even result in one slow
node causing other nodes' responses to timeout meaning that if Node 1 is slow
to respond (due to GC or whatever), then the responses from Nodes 4, 5, and 6,
for instance, could time out. As a result, nodes 4, 5, and 6 could be kicked
out of the cluster as a result of Node 1 being slow.
> Cluster stability issues during high throughput
> -----------------------------------------------
>
> Key: NIFI-3257
> URL: https://issues.apache.org/jira/browse/NIFI-3257
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.0.0, 1.1.0, 1.1.1, 1.0.1
> Reporter: Jeff Storck
>
> During high throughput of data in a cluster (135MB/s), nodes experience
> frequent disconnects (every few minutes) and role switching (Primary and
> Cluster Coordinator). This makes API requests difficult since the requests
> can not be replicated to all nodes while reconnecting. The cluster can
> recover for a time (as mentioned above, for a few minutes) before going
> through another round of disconnects and role switching.
> The cluster is able to continue to process data during these connection and
> role-switching issues.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)