Peter,

It isn't a pattern that is well supported now in a cluster context.

What is needed are automatically load balanced connections with
partitioning.  This would mean a user could select a given
relationship and indicate that data should automatically distributed
and they should be able to express, optionally, if there is a
correlation attribute that is used for ensuring data which belongs
together stays together or becomes together.  We could use this to
automatically have a connection result in data being distributed
across the cluster for load balancing purposes and also ensure that
data is brought back to a single node whenever necessary which is the
case in certain scenarios like fork/distribute/process/join/send and
things like distributed receipt then join for merging (like
defragmenting data which has been split).  To join them together we
need affinity/correlation and this could work based on some sort of
hashing mechanism where there are as many buckets as their are nodes
in a cluster at a given time.  It needs a lot of
thought/design/testing/etc..

I was just having a conversation about this yesterday.  It is
definitely a thing and will be a major effort.  Will make a JIRA for
this soon.

Thanks

On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pwi...@micron.com> wrote:
> Bryan,
>
> We see this with large files that we have split up into smaller files and 
> distributed across the cluster using site-to-site. We then want to merge them 
> back together, so we send them to the primary node before continuing 
> processing.
>
> --Peter
>
> -----Original Message-----
> From: Bryan Bende [mailto:bbe...@gmail.com]
> Sent: Thursday, June 7, 2018 12:47 PM
> To: dev@nifi.apache.org
> Subject: [EXT] Re: Primary Only Content Migration
>
> Peter,
>
> There really shouldn't be any non-source processors scheduled for primary 
> node only. We may even want to consider preventing that option when the 
> processor has an incoming connection to avoid creating any confusion.
>
> As long as you set source processors to primary node only then everything 
> should be ok... if primary node changes, the source processor starts 
> executing on the new primary node, and any flow files it already produced on 
> the old primary node will continue to be worked off by the downstream 
> processors on the old node until they are all processed.
>
> -Bryan
>
>
>
> On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pwi...@micron.com> 
> wrote:
>> I'm sure many of you have the same situation, a flow that runs on a cluster, 
>> and at some point merges back down to a primary only processor; your files 
>> sit there in the queue with nowhere to go... We've used the work around of 
>> having a remote processor group that loops the data back to the primary node 
>> for a while, but would really like a clean/simple solution. This approach 
>> requires that users be able to put an input port on the root flow, and then 
>> route the file back down, which is a nuisance.
>>
>> I have been thinking of adding either a processor that moves data between 
>> specific nodes in a cluster, or a queue (?) option that will let users 
>> migrate the content of a flowfile back to the master node. This would allow 
>> you to move data back to a primary very easily without needing RPG's and 
>> input ports at the root level.
>>
>> All of my development work with NiFi has been focused on processors, so I'm 
>> not really sure where I would start with this.  Thoughts?
>>
>> Thanks,
>>   Peter

Reply via email to