[ https://issues.apache.org/jira/browse/NIFI-9433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452465#comment-17452465 ]
Mark Bean commented on NIFI-9433: --------------------------------- Still chasing this, but I'll offer what I noted. In NioAsyncLoadBalanceClient partitions are added to partitionQueue, but are never removed, e.g. when a connection is removed from the flow. > Load Balancer hangs > ------------------- > > Key: NIFI-9433 > URL: https://issues.apache.org/jira/browse/NIFI-9433 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework > Affects Versions: 1.15.0 > Reporter: Mark Bean > Priority: Critical > > Simplified scenario to demonstrate problem: > A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced > connection -> UpdateAttribute. And, unconnected to the first two processors, > Funnel #1 -> non-load-balanced Connection -> Funnel #2. > GenerateFlowFile is scheduled to run on Primary Node only. It is started. > This causes the connection to be very busy load balancing (round robin). > Then, the connection between the two funnels is removed. > Immediately, an error is thrown, and the flow gets stuck in a state of > constantly throwing errors indicating that a connection (the one just > deleted) does not exist and cannot be balanced. > It is unclear why this connection is being considered by the load balancer at > all. > The sequence of errors include the following: > Primary Node reports > 2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] > o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged > from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], > Swap Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ > ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], > Unacknowledged=[-206, -20600 Bytes] ] > java.lang.RuntimeException: Cannot create negative queue size > 2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] > o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from > FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap > Files=[0], Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ > ActiveQueue=[206, 20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], > Unacknowledged=[-206, -20600 Bytes] ] > java.lang.RuntimeException: Cannot create negative queue size > The above may be a symptom of subsequent errors in the log: > Primary Node reports: > 2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] > o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer > <host:port> > java.io.IOException: Failed to negotiate Protocol Version with Peer > <host:port>. Recommended version 1 but instead of an ACCEPT or REJECT > response got back a response of 33. > Non-Primary Node reports: > 2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] > o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with > Peer<fqdn/IP:port> > java.io.IOException: Expected to receive Transaction Completion Indicator > from Peer <fqdn> but instead received a value of 1 > The highly concerning part is this error which indicates a Connection which > was not scheduled to load balance was attempting to receive a FlowFile. > Non-Primary Node reports: > 2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] > o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from > Peer <fqdn> for Connection with ID <uuid> but no connection exists with that > ID. > Note the that <uuid> value in this message corresponds to the Connection that > was removed causing the errors to begin. Should the above message ever occur? > Does the load balancer ever consider Connections which are configured as "Do > not load balance" > Users have also reported that FlowFiles have been load balanced from one > Connection to another, unrelated Connection on the other Node. (This is still > being verified.) > Finally, on the UI the load-balanced connection indicates it is actively load > balancing some number (206 in this case) of FlowFiles currently in the > connection. And, attempts to "list queue" on this connection show no > FlowFiles. Presumably they are being held by the load balancer and are > inaccessible in the queue. -- This message was sent by Atlassian Jira (v8.20.1#820001)