Re: [EXT] Re: Primary Only Content Migration

Pierre Villard Fri, 08 Jun 2018 01:02:44 -0700

Hi guys,

Koji is right, I initially filed NIFI-4026 to cover this kind of use case.


There is a lot of challenges and ways to address this subject.
Auto-balancing in queues will be a super nice way forward this goal.

I think an easy first step would be to add a checkbox in RPG configuration
allowing the user to specifically send the data to the remote primary node
only. That's an easy information to add in S2S peer status data requested
by clients. It would need proper documentation when we have nodes
disconnecting/reconnecting, but I think that's the "easiest" improvement to
achieve your goal here.

Since the changes are rather complex, I think we need to carefully think
about the solution design here.

Pierre



2018-06-08 3:48 GMT+02:00 Koji Kawamura <ijokaruma...@gmail.com>:

> There is an existing JIRA submitted by Pierre.
> I think its goal is the same with what Joe mentioned above.
> https://issues.apache.org/jira/browse/NIFI-4026
>
> As for hashing and routing data with affinity/correlation, I think
> 'Consistent Hashing' is the most popular approach to minimize the
> impact of node addition/deletion.
> Applying Consistent Hashing to S2S client may not be difficult. The
> challenging part is how to support cluster topology change in the
> middle of transferring data that needs correlation.
>
> A simple challenging scenario:
> Let's say there is a group of 4 FlowFiles having correlation id as 'rel-A'
> 1. Client sends rel-A, data-1of4 to Node1
> 2. Client sends rel-A, data-2of4 to Node1
> 3. NodeN is added and it takes some part in hash key space that Node1
> was assigned to
> 4. Client sends rel-A, data-3of4 to NodeN
> 5. Client sends rel-A, data-4of4 to NodeN
>
> Then, a Merge processor running on Node1 and NodeN can not complete
> because it won't have the whole dataset to merge.
> This situation can be handled manually if we document it well.
> Or adding resending loop, so that:
>
> 5. Client on Node1 resends rel-A, data1of4 to NodeN
> 6. Client on Node1 resends rel-A, data2of4 to NodeN
> 7. Merge processor on NodeN merges the FlowFiles.
>
> I'm interested in working on this improvement, too.
>
> Thanks,
> Koji
>
>
> On Fri, Jun 8, 2018 at 8:19 AM, Joe Witt <joe.w...@gmail.com> wrote:
> > Peter
> >
> > I'm not sure there is a good way for a processor to drive such a thing
> > with existing infrastructure.  The processor having ability to know
> > about the structure of a cluster is not something we have wanted to
> > expose for good reasons.  There would likely need to be a more
> > fundamental point of support for this.
> >
> > I'm not sure what that design would look like just yet - but agreeing
> > this is an important step to take soon.  If you want to start
> > sketching out design ideas that would be awesome.
> >
> > Thanks
> > On Thu, Jun 7, 2018 at 6:11 PM Peter Wicks (pwicks) <pwi...@micron.com>
> wrote:
> >>
> >> Joe,
> >>
> >> I agree it is a lot of work, which is why I was thinking of starting
> with a processor that could do some of these operations before looking
> further. If the processor could move flowfile's between nodes in the
> cluster it would be a good step. Data comes in form a queue on any node,
> but gets written out to a queue on only the desired node; or gets round
> robin outputted for a distribute scenario.
> >>
> >> I want to work on it, and was trying to figure out if it could be done
> using only a processor, or if larger changes would be needed for sure.
> >>
> >> --Peter
> >>
> >> -----Original Message-----
> >> From: Joe Witt [mailto:joe.w...@gmail.com]
> >> Sent: Thursday, June 7, 2018 3:34 PM
> >> To: dev@nifi.apache.org
> >> Subject: Re: [EXT] Re: Primary Only Content Migration
> >>
> >> Peter,
> >>
> >> It isn't a pattern that is well supported now in a cluster context.
> >>
> >> What is needed are automatically load balanced connections with
> partitioning.  This would mean a user could select a given relationship and
> indicate that data should automatically distributed and they should be able
> to express, optionally, if there is a correlation attribute that is used
> for ensuring data which belongs together stays together or becomes
> together.  We could use this to automatically have a connection result in
> data being distributed across the cluster for load balancing purposes and
> also ensure that data is brought back to a single node whenever necessary
> which is the case in certain scenarios like fork/distribute/process/join/send
> and things like distributed receipt then join for merging (like
> defragmenting data which has been split).  To join them together we need
> affinity/correlation and this could work based on some sort of hashing
> mechanism where there are as many buckets as their are nodes in a cluster
> at a given time.  It needs a lot of thought/design/testing/etc..
> >>
> >> I was just having a conversation about this yesterday.  It is
> definitely a thing and will be a major effort.  Will make a JIRA for this
> soon.
> >>
> >> Thanks
> >>
> >> On Thu, Jun 7, 2018 at 5:21 PM, Peter Wicks (pwicks) <pwi...@micron.com>
> wrote:
> >> > Bryan,
> >> >
> >> > We see this with large files that we have split up into smaller files
> and distributed across the cluster using site-to-site. We then want to
> merge them back together, so we send them to the primary node before
> continuing processing.
> >> >
> >> > --Peter
> >> >
> >> > -----Original Message-----
> >> > From: Bryan Bende [mailto:bbe...@gmail.com]
> >> > Sent: Thursday, June 7, 2018 12:47 PM
> >> > To: dev@nifi.apache.org
> >> > Subject: [EXT] Re: Primary Only Content Migration
> >> >
> >> > Peter,
> >> >
> >> > There really shouldn't be any non-source processors scheduled for
> primary node only. We may even want to consider preventing that option when
> the processor has an incoming connection to avoid creating any confusion.
> >> >
> >> > As long as you set source processors to primary node only then
> everything should be ok... if primary node changes, the source processor
> starts executing on the new primary node, and any flow files it already
> produced on the old primary node will continue to be worked off by the
> downstream processors on the old node until they are all processed.
> >> >
> >> > -Bryan
> >> >
> >> >
> >> >
> >> > On Thu, Jun 7, 2018 at 1:55 PM, Peter Wicks (pwicks) <
> pwi...@micron.com> wrote:
> >> >> I'm sure many of you have the same situation, a flow that runs on a
> cluster, and at some point merges back down to a primary only processor;
> your files sit there in the queue with nowhere to go... We've used the work
> around of having a remote processor group that loops the data back to the
> primary node for a while, but would really like a clean/simple solution.
> This approach requires that users be able to put an input port on the root
> flow, and then route the file back down, which is a nuisance.
> >> >>
> >> >> I have been thinking of adding either a processor that moves data
> between specific nodes in a cluster, or a queue (?) option that will let
> users migrate the content of a flowfile back to the master node. This would
> allow you to move data back to a primary very easily without needing RPG's
> and input ports at the root level.
> >> >>
> >> >> All of my development work with NiFi has been focused on processors,
> so I'm not really sure where I would start with this.  Thoughts?
> >> >>
> >> >> Thanks,
> >> >>   Peter
>

Re: [EXT] Re: Primary Only Content Migration

Reply via email to