Peter, It isn't a pattern that is well supported now in a cluster context.
What is needed are automatically load balanced connections with partitioning. This would mean a user could select a given relationship and indicate that data should automatically distributed and they should be able to express, optionally, if there is a correlation attribute that is used for ensuring data which belongs together stays together or becomes together. We could use this to automatically have a connection result in data being distributed across the cluster for load balancing purposes and also ensure that data is brought back to a single node whenever necessary which is the case in certain scenarios like fork/distribute/process/join/send and things like distributed receipt then join for merging (like defragmenting data which has been split). To join them together we need affinity/correlation and this could work based on some sort of hashing mechanism where there are as many buckets as their are nodes in a cluster at a given time. It needs a lot of thought/design/testing/etc.. I was just having a conversation about this yesterday. It is definitely a thing and will be a major effort. Will make a JIRA for this soon. Thanks On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pwi...@micron.com> wrote: > Bryan, > > We see this with large files that we have split up into smaller files and > distributed across the cluster using site-to-site. We then want to merge them > back together, so we send them to the primary node before continuing > processing. > > --Peter > > -----Original Message----- > From: Bryan Bende [mailto:bbe...@gmail.com] > Sent: Thursday, June 7, 2018 12:47 PM > To: dev@nifi.apache.org > Subject: [EXT] Re: Primary Only Content Migration > > Peter, > > There really shouldn't be any non-source processors scheduled for primary > node only. We may even want to consider preventing that option when the > processor has an incoming connection to avoid creating any confusion. > > As long as you set source processors to primary node only then everything > should be ok... if primary node changes, the source processor starts > executing on the new primary node, and any flow files it already produced on > the old primary node will continue to be worked off by the downstream > processors on the old node until they are all processed. > > -Bryan > > > > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <pwi...@micron.com> > wrote: >> I'm sure many of you have the same situation, a flow that runs on a cluster, >> and at some point merges back down to a primary only processor; your files >> sit there in the queue with nowhere to go... We've used the work around of >> having a remote processor group that loops the data back to the primary node >> for a while, but would really like a clean/simple solution. This approach >> requires that users be able to put an input port on the root flow, and then >> route the file back down, which is a nuisance. >> >> I have been thinking of adding either a processor that moves data between >> specific nodes in a cluster, or a queue (?) option that will let users >> migrate the content of a flowfile back to the master node. This would allow >> you to move data back to a primary very easily without needing RPG's and >> input ports at the root level. >> >> All of my development work with NiFi has been focused on processors, so I'm >> not really sure where I would start with this. Thoughts? >> >> Thanks, >> Peter