Re: Persistence of intermediary data in NiFi ?

Joe Witt Sun, 20 Mar 2016 10:26:03 -0700

Dmitry,

Good questions and things to think about.  You are right into the
heart of the framework.  So there are three repositories to consider
that I'll talk about here but they are probably pretty well covered on
the existing guides.  I know Joe Percivall has a 'day in the life of a
flow file' document that he is nearly ready to send out - that should
help a lot!

Key terminology item:
"Flow File" This is the metadata about a 'thing' in the flow and the
content of that 'thing'.  A flowfile can represent an 'event', a
'record', a 'tuple', 'a bag of bits'.  The bottom line is a flow file
holds context about that thing and the data about that thing.  People
coming from different frames of reference call these things by various
names 'data, event, record, file, etc...' so keep that in mind.

So three repositories to understand and I'll be intermingling the
concept and the default implementation here a bit so just keep that in
mind.

1) Flow File Repository
- This is where the 'fact-of' a flow file lives.  This holds things
like the identifier of a flow file, its name, entry data, map of
attributes.  The typical/default implementation is a write ahead log
which is keeping track of the persistent state of these flow file
objects.  Key thing to realize is that this does *not* include the
content of a flow file.

2) Content Repository
- This is where the 'bytes' of the thing live.  So let's say you have
a flowfile that is a JSON document.  The actual JSON bytes live here.
The 'name' of that thing and things we know about it or have learned
about it live in the flow file repo.  Ok, so in the content repository
the default implementation is to persist the content to disk.  No we
do not persist it every time a flow file moves from processor to
processor.  More on that in a minute.  Also, it never needs to be
fully in memory and more on that in a minute too.  Disks these days
and things like caching in Linux are awesome.  The content repository
is designed to help us greatly take advantage of that.  The repository
design also allows us to take great advantage of things like copy on
write (only ever make a new version of something once it is
manipulated) and pass by reference (never make a clone of the bytes
but rather make and pass pointers).  So, in this sense you can think
of the content repository (at least the default implementation) as an
immutable versioned content store.  Very powerful and very fast.

3) Provenance Repository
- This is where events about the events live as data comes into, goes
through, and leaves the flow.  These events form the truth of what
happened and you can think of it as index of events about what
happened in the flow.  it has information which looks a lot like what
you'd see in the Flow File repository (no content) and some nice
relationship data so you dont' have to trudge through log files
anymore to figure out what happened.  What is cool too is that it has
pointers to content.  This is how we can let you click on content at
any point of its life in the flow.  Doing some complex transformation?
 Use provenance to click on content easily before and after some
important transformation even to prove it works.  Flow not quite right
yet?  Use provenance to hit replay after you tweak the settings and
keep watching it evolve until you are sure it works.  You can even do
all this live in the flow on a dev copy (thanks again to copy on
write/pass by reference) then when you are ready merge it into be the
production feed.  In this way it is a lot like the mentality a
developer has when using Git just now with a fun UI.

Ok, so we've talked through a lot and I probably skipped major
details.  Feel free to ask as much as you want.

Last thing I'll mention for now is that all the interactions a
developer has to these interfaces occurs through the ProcessSession
abstraction.  This is how you can build processors to interact with
these data objects as streams and thus never need to load the whole
content into memory.  We can do things like compress or encrypt
massive multi-GB objects and it never look any different on the heap
than a 1 KB message.  That is because of the design of the
ProcessSession and these repositories.  So, only when you manipulate
content is the content repository engaged to make new versions.  It
doesn't have to get rid of the previous version until nothing else
references it.  This is then similar in concept to the design of the
Heap and Garbage Collection in Java.  Just becomes a really nice model
and since we control when we age things off we can let you do
click-to-content in a natural/efficient way.

Hopefully this helps a bit.

Thanks
Joe

On Sun, Mar 20, 2016 at 11:48 AM, Dmitry Goldenberg
<dgoldenberg...@gmail.com> wrote:
> I apologize if this is spelled out somewhere in the documentation.
>
> There is a certain amount of fuzzyness around the notion of a FlowFile.  Is
> this really always a file? or is it a "document" or an "item" which may have
> a link to an actual file / byte content, whether on disk or elsewhere?  My
> noob-level understanding is that it's the latter - could someone confirm?
>
> Furthermore, when data is moving between Processors in a Dataflow, how is
> that done?  Is the data streamed in memory?  Is there a spill-to-disk option
> to configure how disk spillage would be done?  Or do FlowFiles always get
> written to disk prior to being sent to the next destination?
>
> I would think that persisting to disk after every step would be quite
> expensive.  Is that simply not what NiFi does?
>
> Thanks.

Re: Persistence of intermediary data in NiFi ?

Reply via email to