Awesome, all makes sense, and I think that answers my immediate questions. Thanks guys!
Jonathan "Natty" Natkins StreamSets | Customer Engagement Engineer mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice> On Tue, Dec 16, 2014 at 8:22 PM, Joe Witt <[email protected]> wrote: > > Jonathan > > NAR stands for "NiFi Archive" and it is simply our approach to classloader > isolation. In maven then you can build these nar's and they become bundles > of processors/services/whatever NiFi extension you want which you can then > include into a build ala carte style or however. The framework itself is > even a nar. We did all this to first reduce the amount stuff (and thus > possible conflicts) on the system classloader and second we did it so > different bundles of capabilities could have different and sometimes > conflicting dependencies but there to be no problem. > > I am with you on the distributed side being where things get more > interesting. Processors execute in nodes and all nodes execute all > processors. They only fire if they have data though. So you interface > with protocols which inherently offer load balancing or you ensure that > NiFi is pulling the data and it by its nature will do load balancing. We > then include things like back-pressure and congestion avoidance functions > which means even if load balancing isn't working we can still have nodes > 'back off' and thus other nodes naturally pick up the slack. This helps to > address the natural hot-spotting that tends to occur in data flow and data > processing. > > Fault tolerance: If a node dies the data on the node at this time is 'as > dead as the node'. Any new data will be routed around that node in the > case of pushes from producers to NiFi or other nodes will take over the > load of pulling in the case of pull by NiFi from those consumers. The > cluster of NiFi Nodes is managed by a thing called the 'NiFi Cluster > Manager [NCM]'. As long as it is up you can comman and control all the > nodes in the cluster. If the NCM is dead the nodes all keep rolling based > on what they knew last. You then also have to be concerned about network > partitioning. Each node is always heartbeating to the NCM. If the NCM > doesn't get a heartbeat in a reasonable period of time it will mark that > node as disconnected. When any node in the cluster is disconnected then > data flow changes cannot be made. This prevents the flow from being into a > bad state as a result of the partitioning event. If you know that node is > not just isolated but is actually 'dead', 'gone for while', whatever then > you can delete it from the cluster and then you'll be able to make changes > to the flow again. > > As new nodes join the cluster the first thing that happens is the typical > 'am i authorized' to be here check. Once through that then the NCM sends a > copy of the latest flow configuration to the node at which point the node > loads the flow begins its journey. > > I'm glossing over a lot of the fun details of course but we will obviously > dig further here and if there are any tracks you want to go deeper on > sooner then please advise. > > Thanks > Joe > > > On Tue, Dec 16, 2014 at 11:04 PM, Jonathan Natkins <[email protected]> > wrote: > > > > Hey Joe, > > > > This is really helpful. In terms of examples of good architectural > > descriptions, I think the Kafka overview is pretty great (I think a lot > of > > it came from the original academic paper). It's very helpful for > > understanding the key concepts and design trade-offs. My personal feeling > > is that diagrams are very helpful: my guess is that the single-node > > processing layer is not all that complex, but where architecture gets > > interesting (and where a lot of my curiosity lies) is once you get into > the > > distributed modes. How is fault-tolerance handled, how do I specify which > > processors operate on which nodes, how is cluster membership handled, > etc. > > > > Also: what does NAR stand for? > > > > Thanks! > > Natty > > > > Jonathan "Natty" Natkins > > StreamSets | Customer Engagement Engineer > > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice> > > > > > > On Tue, Dec 16, 2014 at 6:08 PM, Joe Witt <[email protected]> wrote: > > > > > > Natty > > > > > > There are very little existing resources as of yet but fully recognize > > that > > > this is a problem. > > > > > > https://issues.apache.org/jira/browse/NIFI-162 > > > > > > If there are specific examples of architectural descriptions that you > > think > > > are well done I'd love to see them. > > > > > > The very brief version of how execution and scale work: > > > > > > Execution: > > > NiFi runs within the JVM. As data flows through a given NiFi instance > > > there are two primary repositories that we keep which hold key > > information > > > about the data. One repository is known as the Flowfile repository and > > its > > > job is to keep information about the data in the flow. The other > > > repository is the content repository and it keeps the actual data. In > > nifi > > > you're composing directed graphs of processors. Each processor is > > > scheduled to run according to its configured scheduling style and is > > given > > > time to run by a flow controller/thread-pool. When a given process > runs > > it > > > is given access to the Flowfile Repository and content repository as > > > necessary to be able to access and modify the data in a safe and > > efficient > > > manner. > > > > > > Out of the box the flow file repo can be all in-memory or run off a > > > write-ahead log based implementation with high reliability and > > throughput. > > > For the content repo it too supports all in-memory or using one or more > > > disks in parallel yielding again very high throughput with excellent > > > durability. > > > > > > Scale: > > > Vertical: Supports highly concurrent processing and can utilize > multiple > > > physical disks in parallel. > > > Horizontal: Supports clustering whereby a cluster manager relays > commands > > > to nodes in the cluster and coordinates all their responses. Nodes > then > > > operate as they would if they were standalone. > > > > > > Lots more coming here of course but if you have specific questions now > > > please feel free to fire away. > > > > > > Thanks > > > Joe > > > > > > > > > > > > > > > On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins < > [email protected]> > > > wrote: > > > > > > > > Hi there, > > > > > > > > I was curious if there exist any resources that would be helpful in > > > > understanding the NiFi architecture. I'm trying to understand how > > > dataflows > > > > are executed, or how I would scale the system. Are there any > > > architectural > > > > docs, or blog posts, or academic papers out there that would be > > helpful? > > > > > > > > Alternatively, some pointers into the code base as to where the > > execution > > > > layer code lives could be helpful. > > > > > > > > Thanks! > > > > Natty > > > > > > > > Jonathan "Natty" Natkins > > > > StreamSets | Customer Engagement Engineer > > > > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice > > > > > > > > > > > >
