markap14 commented on issue #3850: NIFI-6398 Added the 'replace first' and 'replace all' strategy to ReplaceText URL: https://github.com/apache/nifi/pull/3850#issuecomment-547131105 @HorizonNet can you explain the difference between "Replace All" and "Regex"? I believe they are intended to accomplish the same thing, but the existing Regex allows the user to change the behavior between Line-by-Line vs. Entire Text. Using the Line-by-Line mode is preferred if the Regex does not span multiple lines because it uses dramatically less heap. It also allows for back references, etc. Not sure if the new "Replace All" provides a capability that isn't currently supported, that I'm just missing? The Replace First is a nice addition. We should ensure, though, any time that we create a `String` from `byte[]` that we pass in the character set in the constructor. It looks like it's used when serializing the `String` back to the `byte[]` but not when creating the `String` to begin with. Finally, rather than using `IOUtils.toByteArray()`, would recommend creating an `byte[]` and then using `StringUtils.fillBuffer`, as is done in the `RegexReplace` strategy. Because we already know the size of the FlowFile, this is far more efficient (because it doesn't have to keep filling a buffer, creating a new one, and copying bytes over) and can cut down the amount of heap used to store the buffered data by up to 50% (because the ByteArrayOutputStream doubles the buffer size every time it runs out of space, so if it's already 512 KB and we need one more byte it creates a 1 MB buffer just to use 512 KB + 1 byte, for example, whereas a direct allocation takes exactly the right number of bytes).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services