Re: NiFi architecture

Jonathan Natkins Tue, 16 Dec 2014 20:43:06 -0800

Awesome, all makes sense, and I think that answers my immediate questions.
Thanks guys!


Jonathan "Natty" Natkins
StreamSets | Customer Engagement Engineer
mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>


On Tue, Dec 16, 2014 at 8:22 PM, Joe Witt <[email protected]> wrote:
>
> Jonathan
>
> NAR stands for "NiFi Archive" and it is simply our approach to classloader
> isolation.  In maven then you can build these nar's and they become bundles
> of processors/services/whatever NiFi extension you want which you can then
> include into a build ala carte style or however.  The framework itself is
> even a nar.  We did all this to first reduce the amount stuff (and thus
> possible conflicts) on the system classloader and second we did it so
> different bundles of capabilities could have different and sometimes
> conflicting dependencies but there to be no problem.
>
> I am with you on the distributed side being where things get more
> interesting.  Processors execute in nodes and all nodes execute all
> processors.  They only fire if they have data though.  So you interface
> with protocols which inherently offer load balancing or you ensure that
> NiFi is pulling the data and it by its nature will do load balancing.  We
> then include things like back-pressure and congestion avoidance functions
> which means even if load balancing isn't working we can still have nodes
> 'back off' and thus other nodes naturally pick up the slack.  This helps to
> address the natural hot-spotting that tends to occur in data flow and data
> processing.
>
> Fault tolerance:  If a node dies the data on the node at this time is 'as
> dead as the node'.  Any new data will be routed around that node in the
> case of pushes from producers to NiFi or other nodes will take over the
> load of pulling in the case of pull by NiFi from those consumers.  The
> cluster of NiFi Nodes is managed by a thing called the 'NiFi Cluster
> Manager [NCM]'.  As long as it is up you can comman and control all the
> nodes in the cluster.  If the NCM is dead the nodes all keep rolling based
> on what they knew last.  You then also have to be concerned about network
> partitioning.  Each node is always heartbeating to the NCM.  If the NCM
> doesn't get a heartbeat in a reasonable period of time it will mark that
> node as disconnected.  When any node in the cluster is disconnected then
> data flow changes cannot be made.  This prevents the flow from being into a
> bad state as a result of the partitioning event.  If you know that node is
> not just isolated but is actually 'dead', 'gone for  while', whatever then
> you can delete it from the cluster and then you'll be able to make changes
> to the flow again.
>
> As new nodes join the cluster the first thing that happens is the typical
> 'am i authorized' to be here check.  Once through that then the NCM sends a
> copy of the latest flow configuration to the node at which point the node
> loads the flow begins its journey.
>
> I'm glossing over a lot of the fun details of course but we will obviously
> dig further here and if there are any tracks you want to go deeper on
> sooner then please advise.
>
> Thanks
> Joe
>
>
> On Tue, Dec 16, 2014 at 11:04 PM, Jonathan Natkins <[email protected]>
> wrote:
> >
> > Hey Joe,
> >
> > This is really helpful. In terms of examples of good architectural
> > descriptions, I think the Kafka overview is pretty great (I think a lot
> of
> > it came from the original academic paper). It's very helpful for
> > understanding the key concepts and design trade-offs. My personal feeling
> > is that diagrams are very helpful: my guess is that the single-node
> > processing layer is not all that complex, but where architecture gets
> > interesting (and where a lot of my curiosity lies) is once you get into
> the
> > distributed modes. How is fault-tolerance handled, how do I specify which
> > processors operate on which nodes, how is cluster membership handled,
> etc.
> >
> > Also: what does NAR stand for?
> >
> > Thanks!
> > Natty
> >
> > Jonathan "Natty" Natkins
> > StreamSets | Customer Engagement Engineer
> > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
> >
> >
> > On Tue, Dec 16, 2014 at 6:08 PM, Joe Witt <[email protected]> wrote:
> > >
> > > Natty
> > >
> > > There are very little existing resources as of yet but fully recognize
> > that
> > > this is a problem.
> > >
> > > https://issues.apache.org/jira/browse/NIFI-162
> > >
> > > If there are specific examples of architectural descriptions that you
> > think
> > > are well done I'd love to see them.
> > >
> > > The very brief version of how execution and scale work:
> > >
> > > Execution:
> > > NiFi runs within the JVM.  As data flows through a given NiFi instance
> > > there are two primary repositories that we keep which hold key
> > information
> > > about the data.  One repository is known as the Flowfile repository and
> > its
> > > job is to keep information about the data in the flow.  The other
> > > repository is the content repository and it keeps the actual data.  In
> > nifi
> > > you're composing directed graphs of processors.  Each processor is
> > > scheduled to run according to its configured scheduling style and is
> > given
> > > time to run by a flow controller/thread-pool.  When a given process
> runs
> > it
> > > is given access to the Flowfile Repository and content repository as
> > > necessary to be able to access and modify the data in a safe and
> > efficient
> > > manner.
> > >
> > > Out of the box the flow file repo can be all in-memory or run off a
> > > write-ahead log based implementation with high reliability and
> > throughput.
> > > For the content repo it too supports all in-memory or using one or more
> > > disks in parallel yielding again very high throughput with excellent
> > > durability.
> > >
> > > Scale:
> > > Vertical: Supports highly concurrent processing and can utilize
> multiple
> > > physical disks in parallel.
> > > Horizontal: Supports clustering whereby a cluster manager relays
> commands
> > > to nodes in the cluster and coordinates all their responses.  Nodes
> then
> > > operate as they would if they were standalone.
> > >
> > > Lots more coming here of course but if you have specific questions now
> > > please feel free to fire away.
> > >
> > > Thanks
> > > Joe
> > >
> > >
> > >
> > >
> > > On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins <
> [email protected]>
> > > wrote:
> > > >
> > > > Hi there,
> > > >
> > > > I was curious if there exist any resources that would be helpful in
> > > > understanding the NiFi architecture. I'm trying to understand how
> > > dataflows
> > > > are executed, or how I would scale the system. Are there any
> > > architectural
> > > > docs, or blog posts, or academic papers out there that would be
> > helpful?
> > > >
> > > > Alternatively, some pointers into the code base as to where the
> > execution
> > > > layer code lives could be helpful.
> > > >
> > > > Thanks!
> > > > Natty
> > > >
> > > > Jonathan "Natty" Natkins
> > > > StreamSets | Customer Engagement Engineer
> > > > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice
> >
> > > >
> > >
> >
>

Re: NiFi architecture

Reply via email to