[jira] [Updated] (NIFI-10887) Improve Performance of ReplaceText processor

Mark Payne (Jira) Mon, 28 Nov 2022 10:45:32 -0800


     [ 
https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mark Payne updated NIFI-10887:
------------------------------
    Attachment: InferSchema-AfterChanges.png
                InferSchema-BeforeChanges.png

> Improve Performance of ReplaceText processor
> --------------------------------------------
>
>                 Key: NIFI-10887
>                 URL: https://issues.apache.org/jira/browse/NIFI-10887
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>              Labels: performance
>             Fix For: 1.20.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When performing some tests with the ReplaceText processor, I found that it 
> seemed to be quite a bit slower than I expected, especially when using a 
> Replacement Strategy of "Literal Replace" and when using a lot of small 
> FlowFiles.
> As a result, I performed some profiling and identified a few areas that could 
> use some improvement:
>  * When using the Literal Replace strategy, we  find matches using 
> {{Pattern.compile(Pattern.quote(...));}} and then using 
> {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to 
> just using {{String.indexOf(...)}} and accounted for approximately 30% of the 
> time spent in the processor.
>  * A significant amount of time was spent flushing the write buffer, as it 
> flushes to disk when finished writing to each individual FlowFile. Even when 
> we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets 
> delegated all the way down to the FileOutputStream. However, when using 
> ProcessSession.append(), we intercept this with a NonFlushableOutputStream. 
> We should do this when calling ProcessSession.write() as well. While it makes 
> sense to flush data from the Processor layer's buffer, there's no need to 
> flush past the session layer until the session is committed.
>  * A decent bit of time was spent in the session's get() method calling 
> {{{}final Set<FlowFileRecord> set = 
> unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> 
> new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's 
> hashCode() method, which is the JVM default. We can easily implement 
> hashCode() to just return the hashCode of the identifier, which is a String. 
> This is a pre-computed hashcode so provides constant time of 0 ms (with the 
> exception of the method call itself) so eliminates the expense here.
>  * When using a Run Duration > 0 ms, we can hold InputStreams open by 
> processing multiple FlowFiles in a given Session. This can also significantly 
> improve performance. As such, we should make the default run duration 25 ms 
> instead of 0 ms.
>  * A common pattern with ReplaceText is to prepend text to the beginning of a 
> FlowFile, or line. And then use another ReplaceText to append text to the end 
> of a FlowFile, or line. We should have a strategy for "Surround" that allow 
> us to both Prepend text and Append text. This will result in double the 
> performance for this use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NIFI-10887) Improve Performance of ReplaceText processor

Reply via email to