Hey Joe,

This is really helpful. In terms of examples of good architectural
descriptions, I think the Kafka overview is pretty great (I think a lot of
it came from the original academic paper). It's very helpful for
understanding the key concepts and design trade-offs. My personal feeling
is that diagrams are very helpful: my guess is that the single-node
processing layer is not all that complex, but where architecture gets
interesting (and where a lot of my curiosity lies) is once you get into the
distributed modes. How is fault-tolerance handled, how do I specify which
processors operate on which nodes, how is cluster membership handled, etc.

Also: what does NAR stand for?

Thanks!
Natty

Jonathan "Natty" Natkins
StreamSets | Customer Engagement Engineer
mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>


On Tue, Dec 16, 2014 at 6:08 PM, Joe Witt <[email protected]> wrote:
>
> Natty
>
> There are very little existing resources as of yet but fully recognize that
> this is a problem.
>
> https://issues.apache.org/jira/browse/NIFI-162
>
> If there are specific examples of architectural descriptions that you think
> are well done I'd love to see them.
>
> The very brief version of how execution and scale work:
>
> Execution:
> NiFi runs within the JVM.  As data flows through a given NiFi instance
> there are two primary repositories that we keep which hold key information
> about the data.  One repository is known as the Flowfile repository and its
> job is to keep information about the data in the flow.  The other
> repository is the content repository and it keeps the actual data.  In nifi
> you're composing directed graphs of processors.  Each processor is
> scheduled to run according to its configured scheduling style and is given
> time to run by a flow controller/thread-pool.  When a given process runs it
> is given access to the Flowfile Repository and content repository as
> necessary to be able to access and modify the data in a safe and efficient
> manner.
>
> Out of the box the flow file repo can be all in-memory or run off a
> write-ahead log based implementation with high reliability and throughput.
> For the content repo it too supports all in-memory or using one or more
> disks in parallel yielding again very high throughput with excellent
> durability.
>
> Scale:
> Vertical: Supports highly concurrent processing and can utilize multiple
> physical disks in parallel.
> Horizontal: Supports clustering whereby a cluster manager relays commands
> to nodes in the cluster and coordinates all their responses.  Nodes then
> operate as they would if they were standalone.
>
> Lots more coming here of course but if you have specific questions now
> please feel free to fire away.
>
> Thanks
> Joe
>
>
>
>
> On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins <[email protected]>
> wrote:
> >
> > Hi there,
> >
> > I was curious if there exist any resources that would be helpful in
> > understanding the NiFi architecture. I'm trying to understand how
> dataflows
> > are executed, or how I would scale the system. Are there any
> architectural
> > docs, or blog posts, or academic papers out there that would be helpful?
> >
> > Alternatively, some pointers into the code base as to where the execution
> > layer code lives could be helpful.
> >
> > Thanks!
> > Natty
> >
> > Jonathan "Natty" Natkins
> > StreamSets | Customer Engagement Engineer
> > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
> >
>

Reply via email to