[ https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677941#comment-17677941 ]
ASF subversion and git services commented on NIFI-10887: -------------------------------------------------------- Commit d5c79fdcd1c806f6376c101f392e333c4a86b805 in nifi's branch refs/heads/dependabot/npm_and_yarn/nifi-registry/nifi-registry-core/nifi-registry-web-ui/src/main/decode-uri-component-0.2.2 from Mark Payne [ https://gitbox.apache.org/repos/asf?p=nifi.git;h=d5c79fdcd1 ] NIFI-10887: Addressed performance concerned. Use String.indexOf() instead of Pattern.matcher() when using Literal Replace. Use a NonFlushableOutputStream when ProcessSession.write() is called. Implemented hashCode() on AbstractConnection. Updated default Run Schedule on ReplaceText from 0 ms to 25 ms. Added a Surround Replacement strategy that allows both prepending and appending text. Updated unit tests to account for this. Signed-off-by: Matthew Burgess <mattyb...@apache.org> This closes #6724 > Improve Performance of ReplaceText processor > -------------------------------------------- > > Key: NIFI-10887 > URL: https://issues.apache.org/jira/browse/NIFI-10887 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Mark Payne > Assignee: Mark Payne > Priority: Major > Labels: performance > Fix For: 1.20.0 > > Attachments: ReplaceText-LiteralReplace-0msRunDuration.png, > ReplaceText-LiteralReplace-25msRunDuration.png, > ReplaceText-LiteralReplace-AfterChanges.png, > ReplaceText-LiteralReplace-BeforeChanges.png, > ReplaceText-RegexReplace-AfterChanges.png, > ReplaceText-RegexReplace-BeforeChanges.png > > Time Spent: 40m > Remaining Estimate: 0h > > When performing some tests with the ReplaceText processor, I found that it > seemed to be quite a bit slower than I expected, especially when using a > Replacement Strategy of "Literal Replace" and when using a lot of small > FlowFiles. > As a result, I performed some profiling and identified a few areas that could > use some improvement: > * When using the Literal Replace strategy, we find matches using > {{Pattern.compile(Pattern.quote(...));}} and then using > {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to > just using {{String.indexOf(...)}} and accounted for approximately 30% of the > time spent in the processor. > * A significant amount of time was spent flushing the write buffer, as it > flushes to disk when finished writing to each individual FlowFile. Even when > we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets > delegated all the way down to the FileOutputStream. However, when using > ProcessSession.append(), we intercept this with a NonFlushableOutputStream. > We should do this when calling ProcessSession.write() as well. While it makes > sense to flush data from the Processor layer's buffer, there's no need to > flush past the session layer until the session is committed. > * A decent bit of time was spent in the session's get() method calling > {{{}final Set<FlowFileRecord> set = > unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> > new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's > hashCode() method, which is the JVM default. We can easily implement > hashCode() to just return the hashCode of the identifier, which is a String. > This is a pre-computed hashcode so provides constant time of 0 ms (with the > exception of the method call itself) so eliminates the expense here. > * When using a Run Duration > 0 ms, we can hold InputStreams open by > processing multiple FlowFiles in a given Session. This can also significantly > improve performance. As such, we should make the default run duration 25 ms > instead of 0 ms. > * A common pattern with ReplaceText is to prepend text to the beginning of a > FlowFile, or line. And then use another ReplaceText to append text to the end > of a FlowFile, or line. We should have a strategy for "Surround" that allow > us to both Prepend text and Append text. This will result in double the > performance for this use case. -- This message was sent by Atlassian Jira (v8.20.10#820010)