Hey Joe, This is really helpful. In terms of examples of good architectural descriptions, I think the Kafka overview is pretty great (I think a lot of it came from the original academic paper). It's very helpful for understanding the key concepts and design trade-offs. My personal feeling is that diagrams are very helpful: my guess is that the single-node processing layer is not all that complex, but where architecture gets interesting (and where a lot of my curiosity lies) is once you get into the distributed modes. How is fault-tolerance handled, how do I specify which processors operate on which nodes, how is cluster membership handled, etc.
Also: what does NAR stand for? Thanks! Natty Jonathan "Natty" Natkins StreamSets | Customer Engagement Engineer mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice> On Tue, Dec 16, 2014 at 6:08 PM, Joe Witt <[email protected]> wrote: > > Natty > > There are very little existing resources as of yet but fully recognize that > this is a problem. > > https://issues.apache.org/jira/browse/NIFI-162 > > If there are specific examples of architectural descriptions that you think > are well done I'd love to see them. > > The very brief version of how execution and scale work: > > Execution: > NiFi runs within the JVM. As data flows through a given NiFi instance > there are two primary repositories that we keep which hold key information > about the data. One repository is known as the Flowfile repository and its > job is to keep information about the data in the flow. The other > repository is the content repository and it keeps the actual data. In nifi > you're composing directed graphs of processors. Each processor is > scheduled to run according to its configured scheduling style and is given > time to run by a flow controller/thread-pool. When a given process runs it > is given access to the Flowfile Repository and content repository as > necessary to be able to access and modify the data in a safe and efficient > manner. > > Out of the box the flow file repo can be all in-memory or run off a > write-ahead log based implementation with high reliability and throughput. > For the content repo it too supports all in-memory or using one or more > disks in parallel yielding again very high throughput with excellent > durability. > > Scale: > Vertical: Supports highly concurrent processing and can utilize multiple > physical disks in parallel. > Horizontal: Supports clustering whereby a cluster manager relays commands > to nodes in the cluster and coordinates all their responses. Nodes then > operate as they would if they were standalone. > > Lots more coming here of course but if you have specific questions now > please feel free to fire away. > > Thanks > Joe > > > > > On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins <[email protected]> > wrote: > > > > Hi there, > > > > I was curious if there exist any resources that would be helpful in > > understanding the NiFi architecture. I'm trying to understand how > dataflows > > are executed, or how I would scale the system. Are there any > architectural > > docs, or blog posts, or academic papers out there that would be helpful? > > > > Alternatively, some pointers into the code base as to where the execution > > layer code lives could be helpful. > > > > Thanks! > > Natty > > > > Jonathan "Natty" Natkins > > StreamSets | Customer Engagement Engineer > > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice> > > >
