[ https://issues.apache.org/jira/browse/NIFI-8107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Secules reassigned NIFI-8107: ---------------------------------- Assignee: (was: Eric Secules) > ExtractText Should Search Entire FlowFile Using Streaming > --------------------------------------------------------- > > Key: NIFI-8107 > URL: https://issues.apache.org/jira/browse/NIFI-8107 > Project: Apache NiFi > Issue Type: New Feature > Reporter: Eric Secules > Priority: Major > > There should be an improvement to ExtractText so that the entire content of > the flowfile is scanned for matches in chunks of MAX_BUFFER_SIZE which > overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can do pattern extraction > over arbitrary size files while keeping memory consumption limited. > Consider the use case where I am looking to extract a small pattern of maybe > 100 bytes from files that could be 1MB or 500MB. Looking at the ExtractText > source code, it always allocates a byte array of the maximum size, so it > probably wouldn't be appropriate to set that parameter too high. It's > essential to have the chunks overlap by the maximum length of the capture > group because the match may straddle two chunks. For the same reason it's not > advisable to split the flowfile into chunks of MAX_BUFFER_SIZE using existing > processors. -- This message was sent by Atlassian Jira (v8.20.10#820010)