RE: [EXT] Re: Duplicate flow files without their content

Peter Wicks (pwicks) Wed, 31 Jul 2019 10:54:08 -0700

Lars,

If you are worried about it, using ReplaceText will have the same effect as 
your custom solution. When ReplaceText has it's `Replacement Strategy` set to 
`Always Replace` it doesn't read the contents of the FlowFile and simply writes 
out the replacement Value, which in your case could be an empty string.

Thanks,
  Peter

From: Lars Winderling <lars.winderl...@posteo.de>
Sent: Wednesday, July 31, 2019 11:02 AM
To: dev@nifi.apache.org
Subject: [EXT] Re: Duplicate flow files *without* their content

Hi Edward,

thank you for your input. I didn't know about the cow-semantics, that's really 
useful. I'll check out the in-depth guide for sure!
In my case, the content of the flow file does change heavily from one processor 
to the next one, so I doubt copy-on-write would help here.

Best,
Lars

On Wed, 2019-07-31 at 12:13 +0100, Edward Armes wrote:

HI Lars,

In short. depending on the how a FlowFile is duplicated, the content

shouldn't be duplicated as well.

In general, content is only duplicated when it has been deemed to have been

changed (copy-on-write semantics). For the most part (unless a FlowFIle has

a large number of attributes) a FlowFile is actually quite small and

therefore the waste is minimal, hence why they can be held in memory and

passed through a Flow.

The best way to branch/clone a flow file is to add another output from the

processor you want to log the output from, and the Framework that surrounds

a Processor will handle the rest. This does create a duplicate FlowFIle but

doesn't create a copy of the content. In the provenance repository this

marked as a CLONE event for the original FlowFIle and the new FlowFile gets

treated as it's own unique FlowFIle with a reference to the original

content.

This is quite a short explanation, and a better and more in depth

explanation can be found here and I think this covers all the scenarios

you're thinking about:
<https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html>

https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html

.

Edward

On Wed, Jul 31, 2019 at 11:47 AM Lars Winderling <
<mailto:lars.winderl...@posteo.de>

lars.winderl...@posteo.de<mailto:lars.winderl...@posteo.de>

>

wrote:

Dear NiFi community,

I often face the use-case where I import flow files with content of order

O(1gb) or O(10gb) - already compressed.

Let's day I need to branch off of a flow where the actual flow file should

be processed further, and one some side branch I want just to do some kind

of logging or whatever without accessing the flow file's contents. Thus

it's clearly wasteful to duplicate the flow file including content.

For this case I wrote a processor defining 2 relationships: "original" and

"attributes only", so the flow file attributes can be accessed separately

from the content.

I will gladly prepare a PR if anyone finds that worth incorporating into

NiFi.

Only remaining question for me would be: use an individual processor to

that end, or add it to e.g. the DuplicateFlowFile processor. The former

seems cleaner to me. Proposed names would be something like ForkProcessor

(no better idea yet).

Thanks in advance!

Best,

Lars

RE: [EXT] Re: Duplicate flow files *without* their content

Reply via email to

RE: [EXT] Re: Duplicate flow files without their content