Paul Nahay on the wiki asked the following. I've directed him here so we can discuss as appropriate on the mailing lists.
What is the current state, and future plans, for NiFi transferring large data sets? How large can data attached to FlowFiles be, currently? Whatever the answer to #2 above may be, are there any plans for increasing this in the future? Paul, NiFi can handle very small and very large datasets very well today. It has always been the case with NiFi that it reads data from its content repository using input streams and writes content to its content repository using output streams. Because of this, developers building processors can design components which pull in 1 byte objects just as easily as it can pull in 1GB objects (or much more). The content does not ever have to be held in memory in whole except in those cases where a given component is not coded to take advantage of this. Consider a process which pulls data via SFTP or HTTP or some related protocol. NiFi pull data from such an endpoint is streaming the data over the network and writing it directly to disk. We're never holding that whole object in some byte[] in memory. Similarly, when large objects move from processor to processor we're just moving pointers. So all good there. NiFi has been used for a very long time as I describe. In fact, NiFi can even handle cases like merging content together very well for the same reason. We can accumulate a bunch of source content/flowfiles into a single massive output and do so having never held it all in memory at once. This is a really nice thing. Now, where can memory trouble happen? It can happen when the number of flowfile objects (the metadata about them - not the content) are held in memory at once. These can accumulate substantially in certain cases like splitting data for example. Consider an input CSV file with 1,000,000 lines. One might want individual lines so they can run specific per event processes on it. This can be accomplished in a couple ways. 1) They can used SplitText and split into single lines. We'll almost surely run out of heap since we're making 1,000,000 flowfile objects. So avoid this option. 2) They can use SplitText first splitting into say 1,000 lines at a time. This will produce 1,000 output flowfiles each with 1,000 lines in it. They can then do another split text which splits to single lines. This way no single session will ever have more than 1,000 splits in it and things will work great. This combined with backpressure and provenance works extremely well. 3) They can use the new record reader/writer processors which support really powerful and common patterns in a format and schema aware manner and which makes setup really easy and reusable. In fact, this approach in many cases means they dont even need to split up the data since the record oriented processors will know how to demarcate events 'as they exist in their source form'. Huge performance gains and usability improvement here. If you're finding challenges with handling large datasets lets talk through the cases you're struggling with and see what can be done. Thanks