Paul Nahay on the wiki asked the following.  I've directed him here so
we can discuss as appropriate on the mailing lists.

What is the current state, and future plans, for NiFi transferring
large data sets?

How large can data attached to FlowFiles be, currently?

Whatever the answer to #2 above may be, are there any plans for
increasing this in the future?


Paul,

NiFi can handle very small and very large datasets very well today.
It has always been the case with NiFi that it reads data from its
content repository using input streams and writes content to its
content repository using output streams.  Because of this, developers
building processors can design components which pull in 1 byte objects
just as easily as it can pull in 1GB objects (or much more).  The
content does not ever have to be held in memory in whole except in
those cases where a given component is not coded to take advantage of
this.  Consider a process which pulls data via SFTP or HTTP or some
related protocol.  NiFi pull data from such an endpoint is streaming
the data over the network and writing it directly to disk.  We're
never holding that whole object in some byte[] in memory.  Similarly,
when large objects move from processor to processor we're just moving
pointers.  So all good there.  NiFi has been used for a very long time
as I describe.

In fact, NiFi can even handle cases like merging content together very
well for the same reason.  We can accumulate a bunch of source
content/flowfiles into a single massive output and do so having never
held it all in memory at once.  This is a really nice thing.

Now, where can memory trouble happen?  It can happen when the number
of flowfile objects (the metadata about them - not the content) are
held in memory at once.  These can accumulate substantially in certain
cases like splitting data for example.  Consider an input CSV file
with 1,000,000 lines.  One might want individual lines so they can run
specific per event processes on it.  This can be accomplished in a
couple ways.

1) They can used SplitText and split into single lines.  We'll almost
surely run out of heap since we're making 1,000,000 flowfile objects.
So avoid this option.

2) They can use SplitText first splitting into say 1,000 lines at a
time.  This will produce 1,000 output flowfiles each with 1,000 lines
in it.  They can then do another split text which splits to single
lines.  This way no single session will ever have more than 1,000
splits in it and things will work great.  This combined with
backpressure and provenance works extremely well.

3) They can use the new record reader/writer processors which support
really powerful and common patterns in a format and schema aware
manner and which makes setup really easy and reusable.  In fact, this
approach in many cases means they dont even need to split up the data
since the record oriented processors will know how to demarcate events
'as they exist in their source form'.  Huge performance gains and
usability improvement here.

If you're finding challenges with handling large datasets lets talk
through the cases you're struggling with and see what can be done.

Thanks

Reply via email to