Re: Load balancer queues stuck on 1.9.2?

Mark Payne Tue, 04 Jun 2019 09:05:57 -0700

Joe,

So it looks like from the Diagnostics info, that there are currently 500 
FlowFiles queued up.
They all live on prod-8.ec2.internal:8443. Of those 500, 250 are waiting to go 
to prod-5.ec2.internal:8443,
and 250 are waiting to go to prod-6.ec2.internal:8443.


So this tells us that if there are any problems, they are likely occurring on 
one of those 3 nodes. It's also not
related to swapping if it's in this state with only 500 FlowFiles queued.

Are you able to confirm that you are indeed receiving data from the load 
balanced queue on both prod-5 and prod-6?


On Jun 4, 2019, at 11:47 AM, Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:

Thanks Mark.

I'm running on Linux.  I've followed your suggestion and added an 
UpdateAttribute processor to the flow, and attached the diagnostics for it.

I also don't see any errors in the logs.

On Tue, Jun 4, 2019 at 3:34 PM Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:
Joe,

The first thing that comes to mind would be NIFI-6285, as Bryan points out. 
However,
that only would affect you if you are running on Windows. So, the first 
question is:
what operating system are you running on? :)

If it's not Windows, I would recommend getting some diagnostics info if 
possible. To do this,
you can go to 
http://<hostname>:<port>/nifi-api/processors/<processor-id>/diagnostics. For 
example,
if you get to nifi by going to http://nifi01:8080/nifi, and you want 
diagnostics for processor with ID 1234,
then try going to http://nifi01:8080/nifi-api/processors/1234/diagnostics in 
your browser.

But a couple of caveats on the 'diagnostics' approach above. It will only work 
if you are running an insecure
NiFi instance, or if you are secured using certificates. We want the 
diagnostics for the Processor that is either
the source of the connection or the destination of the connection - it doesn't 
matter which. This will give us a
lot of information about the internal structure of the connection's FlowFile 
Queue. Of course, you said that your
connection is between two Process Groups, which means that neither the source 
nor the destination is a Processor,
so I would recommend creating a dummy Processor like UpdateAttribute and 
temporarily dragging the Connection
so that it points to that Processor, just to get the diagnostic information, 
then dragging the connection back.

Of course, it would also be helpful to look for any errors in the logs. But if 
you are able to get the diagnostics info
as described above, that's usually the best bet for debugging this sort of 
thing.

Thanks
-Mark


On Jun 4, 2019, at 11:13 AM, Bryan Bende 
<[email protected]<mailto:[email protected]>> wrote:

Joe,

There are two known issues that possibly seem related...

The first was already addressed in 1.9.0, but the reason I mention it
is because it was specific to a connection between two ports:

https://issues.apache.org/jira/browse/NIFI-5919

The second is not in a release yet, but is addressed in master, and
has to do with swapping:

https://issues.apache.org/jira/browse/NIFI-6285

Seems like you wouldn't hit the first one since you are on 1.9.2, but
does seem odd that is the same scenario.

Mark P probably knows best about debugging, but I'm guessing possibly
a thread dump while in this state would be helpful.

-Bryan

On Tue, Jun 4, 2019 at 10:56 AM Joe Gresock 
<[email protected]<mailto:[email protected]>> wrote:

I have round robin load balanced connections working on one cluster, but on 
another, this type of connection seems to be stuck.

What would be the best way to debug this problem?  The connection is from one 
processor group to another, so it's from an Output Port to an Input Port.

My configuration is as follows:
nifi.cluster.load.balance.host=
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec

And I ensured port 6342 is open from one node to another using the cluster node 
addresses.

Is there some error that should appear in the logs if flow files get stuck here?

I suspect they are actually stuck, not just missing, because the remainder of 
the flow is back-pressured up until this point in the flow.

Thanks!
Joe




--
I know what it is to be in need, and I know what it is to have plenty.  I have 
learned the secret of being content in any and every situation, whether well 
fed or hungry, whether living in plenty or in want.  I can do all this through 
him who gives me strength.    -Philippians 4:12-13
<diagnostics.json.gz>

Re: Load balancer queues stuck on 1.9.2?

Reply via email to