[jira] [Commented] (NIFI-10887) Improve Performance of ReplaceText processor

ASF subversion and git services (Jira) Tue, 17 Jan 2023 12:04:15 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677941#comment-17677941
 ]


ASF subversion and git services commented on NIFI-10887:
--------------------------------------------------------

Commit d5c79fdcd1c806f6376c101f392e333c4a86b805 in nifi's branch 
refs/heads/dependabot/npm_and_yarn/nifi-registry/nifi-registry-core/nifi-registry-web-ui/src/main/decode-uri-component-0.2.2
 from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=d5c79fdcd1 ]

NIFI-10887: Addressed performance concerned. Use String.indexOf() instead of 
Pattern.matcher() when using Literal Replace. Use a NonFlushableOutputStream 
when ProcessSession.write() is called. Implemented hashCode() on 
AbstractConnection. Updated default Run Schedule on ReplaceText from 0 ms to 25 
ms. Added a Surround Replacement strategy that allows both prepending and 
appending text. Updated unit tests to account for this.

Signed-off-by: Matthew Burgess <mattyb...@apache.org>

This closes #6724


> Improve Performance of ReplaceText processor
> --------------------------------------------
>
>                 Key: NIFI-10887
>                 URL: https://issues.apache.org/jira/browse/NIFI-10887
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>              Labels: performance
>             Fix For: 1.20.0
>
>         Attachments: ReplaceText-LiteralReplace-0msRunDuration.png, 
> ReplaceText-LiteralReplace-25msRunDuration.png, 
> ReplaceText-LiteralReplace-AfterChanges.png, 
> ReplaceText-LiteralReplace-BeforeChanges.png, 
> ReplaceText-RegexReplace-AfterChanges.png, 
> ReplaceText-RegexReplace-BeforeChanges.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> When performing some tests with the ReplaceText processor, I found that it 
> seemed to be quite a bit slower than I expected, especially when using a 
> Replacement Strategy of "Literal Replace" and when using a lot of small 
> FlowFiles.
> As a result, I performed some profiling and identified a few areas that could 
> use some improvement:
>  * When using the Literal Replace strategy, we  find matches using 
> {{Pattern.compile(Pattern.quote(...));}} and then using 
> {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to 
> just using {{String.indexOf(...)}} and accounted for approximately 30% of the 
> time spent in the processor.
>  * A significant amount of time was spent flushing the write buffer, as it 
> flushes to disk when finished writing to each individual FlowFile. Even when 
> we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets 
> delegated all the way down to the FileOutputStream. However, when using 
> ProcessSession.append(), we intercept this with a NonFlushableOutputStream. 
> We should do this when calling ProcessSession.write() as well. While it makes 
> sense to flush data from the Processor layer's buffer, there's no need to 
> flush past the session layer until the session is committed.
>  * A decent bit of time was spent in the session's get() method calling 
> {{{}final Set<FlowFileRecord> set = 
> unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> 
> new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's 
> hashCode() method, which is the JVM default. We can easily implement 
> hashCode() to just return the hashCode of the identifier, which is a String. 
> This is a pre-computed hashcode so provides constant time of 0 ms (with the 
> exception of the method call itself) so eliminates the expense here.
>  * When using a Run Duration > 0 ms, we can hold InputStreams open by 
> processing multiple FlowFiles in a given Session. This can also significantly 
> improve performance. As such, we should make the default run duration 25 ms 
> instead of 0 ms.
>  * A common pattern with ReplaceText is to prepend text to the beginning of a 
> FlowFile, or line. And then use another ReplaceText to append text to the end 
> of a FlowFile, or line. We should have a strategy for "Surround" that allow 
> us to both Prepend text and Append text. This will result in double the 
> performance for this use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-10887) Improve Performance of ReplaceText processor

Reply via email to