Eric

I believe for now the most appropriate option available is to figure out
what the maximum set of outstanding in process buckets could be and set
that.  Hopefully there is enough memory in the system to handle
whatever that worse case could be in terms of flowfiles attributes being
held in mem.

What should then be considered is how to really support what you want here
which is a reasonable cap on the number of buckets but the processor to
behave differently such that it will only kick out buckets when they
expire.  MergeConent supports some pretty powerful/complex cases and I
think this is one that needs to be included potentially.  The 'harder' part
historically had been that we could keep reading off the incoming flowfile
queue to see if any bucket in the list could take a flowfile but we'd find
nothing fits and put it right back in the queue and likely right back on
top so we'd end up doing nothing.  However, I think we could revisit that
now and utilize a FIFO queue for instance to overcome that.  Just needs a
JIRA/analysis to get over the real hump here.

Thanks


On Fri, Jul 31, 2020 at 8:16 AM Eric Secules <esecu...@gmail.com> wrote:

> Is it possible to surround the merge content processor with a wait/notify
> to enforce that only <numBuckets> unique fragment identifiers are allowed
> into the merge process at one time? I'd rather have the merge processor
> only force buckets out based on time. If there's contention for buckets I'd
> rather the incoming flowfiles wait and then only expire existing buckets
> after a timeout.
>
> ---------- Forwarded message ---------
> From: Eric Secules <esecu...@gmail.com>
> Date: Thu., Jul. 30, 2020, 6:13 p.m.
> Subject: NiFi Merge Content Processor Use Case
> To: <d...@nifi.apache.org>
> Cc: <ggbus...@gmail.com>
>
>
> Hello,
>
> I have a use case for the merge content processor where I have split the
> flow in two branches (original flowfile and PDF one branch may or may not
> take longer than the other) and I want to rejoin those branches using the
> defragment strategy based on the flowfile UUID of the flowfile before the
> split to determine whether both branches have successfully completed. I
> noticed that as I increased the amount of flowfiles generated into the
> system, I got more merge failures because bins were forced to the failure
> relationship before they were able to fully defragment. I can increase the
> number of buckets, but this is just a workaround because it doesn't solve
> the main problem. Is there a design pattern for accurately merging diverted
> branches back together that holds up under load and doesn't require me to
> guess a magic number for the number of bins?
>
> Thanks,
> Eric
>

Reply via email to