Re: Duplicate flow files without their content

Edward Armes Wed, 31 Jul 2019 04:14:31 -0700

HI Lars,

In short. depending on the how a FlowFile is duplicated, the content
shouldn't be duplicated as well.

In general, content is only duplicated when it has been deemed to have been
changed (copy-on-write semantics). For the most part (unless a FlowFIle has
a large number of attributes) a FlowFile is actually quite small and
therefore the waste is minimal, hence why they can be held in memory and
passed through a Flow.

The best way to branch/clone a flow file is to add another output from the
processor you want to log the output from, and the Framework that surrounds
a Processor will handle the rest. This does create a duplicate FlowFIle but
doesn't create a copy of the content. In the provenance repository this
marked as a CLONE event for the original FlowFIle and the new FlowFile gets
treated as it's own unique FlowFIle with a reference to the original
content.

This is quite a short explanation, and a better and more in depth
explanation can be found here and I think this covers all the scenarios
you're thinking about:
https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html.

Edward

On Wed, Jul 31, 2019 at 11:47 AM Lars Winderling <lars.winderl...@posteo.de>
wrote:

> Dear NiFi community,
>
> I often face the use-case where I import flow files with content of order
> O(1gb) or O(10gb) – already compressed.
> Let's day I need to branch off of a flow where the actual flow file should
> be processed further, and one some side branch I want just to do some kind
> of logging or whatever without accessing the flow file's contents. Thus
> it's clearly wasteful to duplicate the flow file including content.
> For this case I wrote a processor defining 2 relationships: "original" and
> "attributes only", so the flow file attributes can be accessed separately
> from the content.
> I will gladly prepare a PR if anyone finds that worth incorporating into
> NiFi.
> Only remaining question for me would be: use an individual processor to
> that end, or add it to e.g. the DuplicateFlowFile processor. The former
> seems cleaner to me. Proposed names would be something like ForkProcessor
> (no better idea yet).
>
> Thanks in advance!
> Best,
> Lars
>

Re: Duplicate flow files *without* their content

Reply via email to

Re: Duplicate flow files without their content