[ 
https://issues.apache.org/jira/browse/NIFI-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936853#comment-14936853
 ] 

Mark Payne commented on NIFI-1008:
----------------------------------

Removed Fix Version of 0.4.0. Looking into the implementation details, this 
could be extremely complex. This is because the FlowFile Repository must be 
updated atomically with all FlowFiles that are processed in a single session. 
In order to do this, we need all of the FlowFiles to be passed as a single 
Collection. Otherwise, if we restart in the middle of a session commit, some of 
the updates will have taken place but not all of them. This could cause some 
really odd behavior.

One possible solution is to modify the FlowFile Repository's definition to 
allow an Iterator to be passed instead of a Collection, and then we can 
implement an iterator that deserializes the objects as needed.

> NiFi should swap out FlowFiles to disk even before the session is committed
> ---------------------------------------------------------------------------
>
>                 Key: NIFI-1008
>                 URL: https://issues.apache.org/jira/browse/NIFI-1008
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Mark Payne
>
> Currently, NiFi will swap out FlowFiles if there are a large number in a 
> FlowFile Queue. This is done to avoid running out of JVM heap space. However, 
> if we have a simple flow like GetFile -> SplitText and GetFile pulls in a 
> large file, SplitText can quickly cause OutOfMemoryError. This is not because 
> it buffers the content of the FlowFile in memory but rather because it holds 
> the millions of FlowFile objects in memory. We can do better.
> When we call session.transfer for the FlowFiles, once we hit a magical 
> threshold (say 10,000), we should swap those FlowFiles to disk and the 
> session should transfer them to the queue "swapped out" flowfiles, rather 
> than having to buffer all of these in memory and then swapping them out once 
> they land in the queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to