[ https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Payne updated NIFI-10887: ------------------------------ Attachment: InferSchema-AfterChanges.png InferSchema-BeforeChanges.png > Improve Performance of ReplaceText processor > -------------------------------------------- > > Key: NIFI-10887 > URL: https://issues.apache.org/jira/browse/NIFI-10887 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Mark Payne > Assignee: Mark Payne > Priority: Major > Labels: performance > Fix For: 1.20.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When performing some tests with the ReplaceText processor, I found that it > seemed to be quite a bit slower than I expected, especially when using a > Replacement Strategy of "Literal Replace" and when using a lot of small > FlowFiles. > As a result, I performed some profiling and identified a few areas that could > use some improvement: > * When using the Literal Replace strategy, we find matches using > {{Pattern.compile(Pattern.quote(...));}} and then using > {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to > just using {{String.indexOf(...)}} and accounted for approximately 30% of the > time spent in the processor. > * A significant amount of time was spent flushing the write buffer, as it > flushes to disk when finished writing to each individual FlowFile. Even when > we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets > delegated all the way down to the FileOutputStream. However, when using > ProcessSession.append(), we intercept this with a NonFlushableOutputStream. > We should do this when calling ProcessSession.write() as well. While it makes > sense to flush data from the Processor layer's buffer, there's no need to > flush past the session layer until the session is committed. > * A decent bit of time was spent in the session's get() method calling > {{{}final Set<FlowFileRecord> set = > unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> > new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's > hashCode() method, which is the JVM default. We can easily implement > hashCode() to just return the hashCode of the identifier, which is a String. > This is a pre-computed hashcode so provides constant time of 0 ms (with the > exception of the method call itself) so eliminates the expense here. > * When using a Run Duration > 0 ms, we can hold InputStreams open by > processing multiple FlowFiles in a given Session. This can also significantly > improve performance. As such, we should make the default run duration 25 ms > instead of 0 ms. > * A common pattern with ReplaceText is to prepend text to the beginning of a > FlowFile, or line. And then use another ReplaceText to append text to the end > of a FlowFile, or line. We should have a strategy for "Surround" that allow > us to both Prepend text and Append text. This will result in double the > performance for this use case. -- This message was sent by Atlassian Jira (v8.20.10#820010)