[ 
https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Payne updated NIFI-10887:
------------------------------
    Description: 
When performing some tests with the ReplaceText processor, I found that it 
seemed to be quite a bit slower than I expected, especially when using a 
Replacement Strategy of "Literal Replace" and when using a lot of small 
FlowFiles.

As a result, I performed some profiling and identified a few areas that could 
use some improvement:
 * When using the Literal Replace strategy, we  find matches using 
{{Pattern.compile(Pattern.quote(...));}} and then using 
{{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to just 
using {{String.indexOf(...)}} and accounted for approximately 30% of the time 
spent in the processor.
 * A significant amount of time was spent flushing the write buffer, as it 
flushes to disk when finished writing to each individual FlowFile. Even when we 
set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets 
delegated all the way down to the FileOutputStream. However, when using 
ProcessSession.append(), we intercept this with a NonFlushableOutputStream. We 
should do this when calling ProcessSession.write() as well. While it makes 
sense to flush data from the Processor layer's buffer, there's no need to flush 
past the session layer until the session is committed.
 * A decent bit of time was spent in the session's get() method calling 
{{{}final Set<FlowFileRecord> set = 
unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> new 
HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's 
hashCode() method, which is the JVM default. We can easily implement hashCode() 
to just return the hashCode of the identifier, which is a String. This is a 
pre-computed hashcode so provides constant time of 0 ms (with the exception of 
the method call itself) so eliminates the expense here.
 * When using a Run Duration > 0 ms, we can hold InputStreams open by 
processing multiple FlowFiles in a given Session. This can also significantly 
improve performance. As such, we should make the default run duration 25 ms 
instead of 0 ms.
 * A common pattern with ReplaceText is to prepend text to the beginning of a 
FlowFile, or line. And then use another ReplaceText to append text to the end 
of a FlowFile, or line. We should have a strategy for "Surround" that allow us 
to both Prepend text and Append text. This will result in double the 
performance for this use case.

  was:
When performing some tests with the ReplaceText processor, I found that it 
seemed to be quite a bit slower than I expected, especially when using a 
Replacement Strategy of "Literal Replace" and when using a lot of small 
FlowFiles.

As a result, I performed some profiling and identified a few areas that could 
use some improvement:
 * When using the Literal Replace strategy, we  find matches using 
{{Pattern.compile(Pattern.quote(...));}} and then using 
{{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to just 
using {{String.indexOf(...)}} and accounted for approximately 30% of the time 
spent in the processor.
 * A significant amount of time was spent flushing the write buffer, as it 
flushes to disk when finished writing to each individual FlowFile. Even when we 
set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets 
delegated all the way down to the FileOutputStream. However, when using 
ProcessSession.append(), we intercept this with a NonFlushableOutputStream. We 
should do this when calling ProcessSession.write() as well. While it makes 
sense to flush data from the Processor layer's buffer, there's no need to flush 
past the session layer until the session is committed.
 * A decent bit of time was spent in the session's get() method calling 
{{{}final Set<FlowFileRecord> set = 
unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> new 
HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's 
hashCode() method, which is the JVM default. We can easily implement hashCode() 
to just return the hashCode of the identifier, which is a String. This is a 
pre-computed hashcode so provides constant time of 0 ms (with the exception of 
the method call itself) so eliminates the expense here.
 * When using a Run Duration > 0 ms, we can hold InputStreams open by 
processing multiple FlowFiles in a given Session. This can also significantly 
improve performance. As such, we should make the default run duration 25 ms 
instead of 0 ms.


> Improve Performance of ReplaceText processor
> --------------------------------------------
>
>                 Key: NIFI-10887
>                 URL: https://issues.apache.org/jira/browse/NIFI-10887
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>              Labels: performance
>
> When performing some tests with the ReplaceText processor, I found that it 
> seemed to be quite a bit slower than I expected, especially when using a 
> Replacement Strategy of "Literal Replace" and when using a lot of small 
> FlowFiles.
> As a result, I performed some profiling and identified a few areas that could 
> use some improvement:
>  * When using the Literal Replace strategy, we  find matches using 
> {{Pattern.compile(Pattern.quote(...));}} and then using 
> {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to 
> just using {{String.indexOf(...)}} and accounted for approximately 30% of the 
> time spent in the processor.
>  * A significant amount of time was spent flushing the write buffer, as it 
> flushes to disk when finished writing to each individual FlowFile. Even when 
> we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets 
> delegated all the way down to the FileOutputStream. However, when using 
> ProcessSession.append(), we intercept this with a NonFlushableOutputStream. 
> We should do this when calling ProcessSession.write() as well. While it makes 
> sense to flush data from the Processor layer's buffer, there's no need to 
> flush past the session layer until the session is committed.
>  * A decent bit of time was spent in the session's get() method calling 
> {{{}final Set<FlowFileRecord> set = 
> unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> 
> new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's 
> hashCode() method, which is the JVM default. We can easily implement 
> hashCode() to just return the hashCode of the identifier, which is a String. 
> This is a pre-computed hashcode so provides constant time of 0 ms (with the 
> exception of the method call itself) so eliminates the expense here.
>  * When using a Run Duration > 0 ms, we can hold InputStreams open by 
> processing multiple FlowFiles in a given Session. This can also significantly 
> improve performance. As such, we should make the default run duration 25 ms 
> instead of 0 ms.
>  * A common pattern with ReplaceText is to prepend text to the beginning of a 
> FlowFile, or line. And then use another ReplaceText to append text to the end 
> of a FlowFile, or line. We should have a strategy for "Surround" that allow 
> us to both Prepend text and Append text. This will result in double the 
> performance for this use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to