Hi Neil,

 

I am also new to working with RPGs and NiFi clusters, but I know enough about 
the NiFi SiteToSite protocol to speculate as to what is going on here (if 
others on this list more knowledgeable than I are willing to chime in to 
confirm or correct this guess, that would be welcome!)

 

If I understand the flow you described where you were attempting to achieve 
round-robin / even distribution, you have 3 RPGs setup, each one configured to 
know about 1 node in your cluster. Therefore, the expectation is that putting a 
DistributeLoad processor upstream of the RPGs configured to round-robin will 
round robin the nodes. I can see how that would be the expectation given that 
configuration.

 

However, I think it's possible that a little bit more is going on under the 
hood with the RPG connection(s). If I understand the details correctly, when an 
RPG is configured, the cluster endpoint(s) you specify are only used to create 
the initial connection. Once the connection is made, the client will ask the 
cluster endpoint it knows about for all the nodes in the cluster, so that if 
nodes are added or removed to the cluster, all connected peers get updated.

 

If that's indeed the case, then the stable state in your flow is 3 RPGs that 
all know about all 3 clusters in the node. That would explain why adding 
DistributeLoad did not change the behavior you observed in your initial flow 
(one RPG configured with all three endpoints). If you wanted to further verify 
this, you could create a flow with a single RPGs that is configured for only 1 
endpoint in your cluster. Over enough time (after the other nodes in the 
cluster have been discovered) you should see flow files reach the nodes you did 
not specify.

 

As to why you are not seeing even distribution, I'm not sure as I don't know 
the specifics of that load-balancing logic for sending files to RPGs. I know it 
is designed to evenly distributed load over time, so it's possible the time 
window over which you are collecting stats is smaller than the time period for 
which the RPG load balancing is optimized. In other words, if you let it run 
for longer and checked, is the load more evenly distributed? My speculation is 
that the load balancing is based on a periodic check of how many files have 
been processed by each node (rather than a check before every send, which would 
have a lot of overhead), and that the configured period of time to change the 
destination load is longer than would show up here. Again, a lot of guessing on 
my part. Maybe others can confirm.

 

I hope this helps. If you have more findings or questions, post them back here.

 

Thanks,
Kevin

 

 

From: Neil Derraugh <[email protected]>
Reply-To: <[email protected]>
Date: Monday, July 31, 2017 at 17:20
To: <[email protected]>
Subject: RPG + FlowFiles In

 

I have a three node cluster and I am trying to rewrite a dataflow that's used 
in several places to have the common parts distribute the data across the 
cluster in a more efficient and load balanced way.  This is my first experience 
with RPGs, so I was just starting from basics and working my way up, but I am 
just out of the gate and already confused.

 

Here's the setup.  I have an input port on my root dataflow which points to a 
LogMessage processor.  In another process group I have an RPG configured with 
the three endpoints of the cluster separated by commas.  Feeding into that is a 
GenerateFlowFile processor which is running every 5ms with 9 concurrent tasks 
on the primary node only.  Everything else has default values.

 

When I start the dataflow it more or less works as expected except that the 
distribution of FlowFiles looks uneven.  That is if I look at the Status 
History of the LogMessage processor and select the FlowFiles In it looks like 
the two non-primary nodes have the bulk of the flows files moving through them. 
 I can wrap my head around that.

 

But then I rewrote it to put a DistributeLoad processor in front of three RPGs, 
one for each node in the cluster, and left it set to `round robin`.  The 
FlowFiles In on the LogMessage processor looks exactly the same as before.  The 
bulk of the FlowFiles In are on the two non-primary nodes.

 

In 5 minutes there are about 500K FlowFiles being processed and two non-primary 
nodes are processing 234238 and 233089, with the primary node processing 47597.

 

What am I missing?  Why doesn't a round robin distribute them evenly?

 

Neil

Reply via email to