Dmitry, Good questions and things to think about. You are right into the heart of the framework. So there are three repositories to consider that I'll talk about here but they are probably pretty well covered on the existing guides. I know Joe Percivall has a 'day in the life of a flow file' document that he is nearly ready to send out - that should help a lot!
Key terminology item: "Flow File" This is the metadata about a 'thing' in the flow and the content of that 'thing'. A flowfile can represent an 'event', a 'record', a 'tuple', 'a bag of bits'. The bottom line is a flow file holds context about that thing and the data about that thing. People coming from different frames of reference call these things by various names 'data, event, record, file, etc...' so keep that in mind. So three repositories to understand and I'll be intermingling the concept and the default implementation here a bit so just keep that in mind. 1) Flow File Repository - This is where the 'fact-of' a flow file lives. This holds things like the identifier of a flow file, its name, entry data, map of attributes. The typical/default implementation is a write ahead log which is keeping track of the persistent state of these flow file objects. Key thing to realize is that this does *not* include the content of a flow file. 2) Content Repository - This is where the 'bytes' of the thing live. So let's say you have a flowfile that is a JSON document. The actual JSON bytes live here. The 'name' of that thing and things we know about it or have learned about it live in the flow file repo. Ok, so in the content repository the default implementation is to persist the content to disk. No we do not persist it every time a flow file moves from processor to processor. More on that in a minute. Also, it never needs to be fully in memory and more on that in a minute too. Disks these days and things like caching in Linux are awesome. The content repository is designed to help us greatly take advantage of that. The repository design also allows us to take great advantage of things like copy on write (only ever make a new version of something once it is manipulated) and pass by reference (never make a clone of the bytes but rather make and pass pointers). So, in this sense you can think of the content repository (at least the default implementation) as an immutable versioned content store. Very powerful and very fast. 3) Provenance Repository - This is where events about the events live as data comes into, goes through, and leaves the flow. These events form the truth of what happened and you can think of it as index of events about what happened in the flow. it has information which looks a lot like what you'd see in the Flow File repository (no content) and some nice relationship data so you dont' have to trudge through log files anymore to figure out what happened. What is cool too is that it has pointers to content. This is how we can let you click on content at any point of its life in the flow. Doing some complex transformation? Use provenance to click on content easily before and after some important transformation even to prove it works. Flow not quite right yet? Use provenance to hit replay after you tweak the settings and keep watching it evolve until you are sure it works. You can even do all this live in the flow on a dev copy (thanks again to copy on write/pass by reference) then when you are ready merge it into be the production feed. In this way it is a lot like the mentality a developer has when using Git just now with a fun UI. Ok, so we've talked through a lot and I probably skipped major details. Feel free to ask as much as you want. Last thing I'll mention for now is that all the interactions a developer has to these interfaces occurs through the ProcessSession abstraction. This is how you can build processors to interact with these data objects as streams and thus never need to load the whole content into memory. We can do things like compress or encrypt massive multi-GB objects and it never look any different on the heap than a 1 KB message. That is because of the design of the ProcessSession and these repositories. So, only when you manipulate content is the content repository engaged to make new versions. It doesn't have to get rid of the previous version until nothing else references it. This is then similar in concept to the design of the Heap and Garbage Collection in Java. Just becomes a really nice model and since we control when we age things off we can let you do click-to-content in a natural/efficient way. Hopefully this helps a bit. Thanks Joe On Sun, Mar 20, 2016 at 11:48 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote: > I apologize if this is spelled out somewhere in the documentation. > > There is a certain amount of fuzzyness around the notion of a FlowFile. Is > this really always a file? or is it a "document" or an "item" which may have > a link to an actual file / byte content, whether on disk or elsewhere? My > noob-level understanding is that it's the latter - could someone confirm? > > Furthermore, when data is moving between Processors in a Dataflow, how is > that done? Is the data streamed in memory? Is there a spill-to-disk option > to configure how disk spillage would be done? Or do FlowFiles always get > written to disk prior to being sent to the next destination? > > I would think that persisting to disk after every step would be quite > expensive. Is that simply not what NiFi does? > > Thanks.