[ 
https://issues.apache.org/jira/browse/NIFI-8107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Secules reassigned NIFI-8107:
----------------------------------

    Assignee:     (was: Eric Secules)

> ExtractText Should Search Entire FlowFile Using Streaming
> ---------------------------------------------------------
>
>                 Key: NIFI-8107
>                 URL: https://issues.apache.org/jira/browse/NIFI-8107
>             Project: Apache NiFi
>          Issue Type: New Feature
>            Reporter: Eric Secules
>            Priority: Major
>
> There should be an improvement to ExtractText so that the entire content of 
> the flowfile is scanned for matches in chunks of MAX_BUFFER_SIZE which 
> overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can do pattern extraction 
> over arbitrary size files while keeping memory consumption limited.
> Consider the use case where I am looking to extract a small pattern of maybe 
> 100 bytes from files that could be 1MB or 500MB. Looking at the ExtractText 
> source code, it always allocates a byte array of the maximum size, so it 
> probably wouldn't be appropriate to set that parameter too high. It's 
> essential to have the chunks overlap by the maximum length of the capture 
> group because the match may straddle two chunks. For the same reason it's not 
> advisable to split the flowfile into chunks of MAX_BUFFER_SIZE using existing 
> processors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to