[ 
https://issues.apache.org/jira/browse/NIFI-9433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452465#comment-17452465
 ] 

Mark Bean commented on NIFI-9433:
---------------------------------

Still chasing this, but I'll offer what I noted. In NioAsyncLoadBalanceClient 
partitions are added to partitionQueue, but are never removed, e.g. when a 
connection is removed from the flow.

> Load Balancer hangs
> -------------------
>
>                 Key: NIFI-9433
>                 URL: https://issues.apache.org/jira/browse/NIFI-9433
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.15.0
>            Reporter: Mark Bean
>            Priority: Critical
>
> Simplified scenario to demonstrate problem:
> A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced 
> connection -> UpdateAttribute. And, unconnected to the first two processors, 
> Funnel #1 -> non-load-balanced Connection -> Funnel #2.
> GenerateFlowFile is scheduled to run on Primary Node only. It is started. 
> This causes the connection to be very busy load balancing (round robin). 
> Then, the connection between the two funnels is removed.
> Immediately, an error is thrown, and the flow gets stuck in a state of 
> constantly throwing errors indicating that a connection (the one just 
> deleted) does not exist and cannot be balanced.
> It is unclear why this connection is being considered by the load balancer at 
> all.
> The sequence of errors include the following:
> Primary Node reports 
> 2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] 
> o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged 
> from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], 
> Swap Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ 
> ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
> Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> 2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] 
> o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from 
> FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap 
> Files=[0], Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ 
> ActiveQueue=[206, 20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
> Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> The above may be a symptom of subsequent errors in the log:
> Primary Node reports:
> 2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] 
> o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer 
> <host:port>
> java.io.IOException: Failed to negotiate Protocol Version with Peer 
> <host:port>. Recommended version 1 but instead of an ACCEPT or REJECT 
> response got back a response of 33.
> Non-Primary Node reports:
> 2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] 
> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with 
> Peer<fqdn/IP:port>
> java.io.IOException: Expected to receive Transaction Completion Indicator 
> from Peer <fqdn> but instead received a value of 1
> The highly concerning part is this error which indicates a Connection which 
> was not scheduled to load balance was attempting to receive a FlowFile.
> Non-Primary Node reports:
> 2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] 
> o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from 
> Peer <fqdn> for Connection with ID <uuid> but no connection exists with that 
> ID.
> Note the that <uuid> value in this message corresponds to the Connection that 
> was removed causing the errors to begin. Should the above message ever occur? 
> Does the load balancer ever consider Connections which are configured as "Do 
> not load balance"
> Users have also reported that FlowFiles have been load balanced from one 
> Connection to another, unrelated Connection on the other Node. (This is still 
> being verified.)
> Finally, on the UI the load-balanced connection indicates it is actively load 
> balancing some number (206 in this case) of FlowFiles currently in the 
> connection. And, attempts to "list queue" on this connection show no 
> FlowFiles. Presumably they are being held by the load balancer and are 
> inaccessible in the queue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to