Re: Nifi cluster nodes regularly stop processing any flowfiles

Aaron Longfield Wed, 03 Aug 2016 15:18:57 -0700

I backported the patch from the master branch and it applies without
changing much at all.  Workflow processing works fine by my eye, but I do
see quite a few provenance warnings logged.  I haven't tried out to see how
that repository is working yet, but I just pushed a few million flowfiles
through my flows, and output probably 25GB to a remote processor without
anything falling over!


-Aaron

On Mon, Aug 1, 2016 at 4:08 PM, Joe Witt <joe.w...@gmail.com> wrote:

> Aaron,
>
> Ok so from a production point of view I'd recommend a small patched
> version of the 0.7 release you were working with.  It might be the
> case that grafting the master line patch for that JIRA into an 0.x
> patch is pretty straight forward.  You could take a look at that as a
> short term option.  We probably should start 0.7.1 and 1.0-M1 type
> release motions soon anyway so this could be a helpful catalyst.
>
> Thanks
> Joe
>
> On Mon, Aug 1, 2016 at 4:03 PM, Aaron Longfield <alongfi...@gmail.com>
> wrote:
> > Joe,
> >
> > Sure, I can give that a go.  Any serious bugs that I might run across
> with
> > that branch that should make me worried about running it on a production
> > flow?
> >
> > -Aaron
> >
> > On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <joe.w...@gmail.com> wrote:
> >>
> >> Aaron,
> >>
> >> It doesn't look like the 0.x version of that patch has been created
> >> yet.  Any chance you could build master (slated for upcoming 1.x
> >> release) and try that?
> >>
> >> Thanks
> >> Joe
> >>
> >> On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <alongfi...@gmail.com>
> >> wrote:
> >> > Great, glad there's already a fixed bug for it!  Is there anything I
> try
> >> > to
> >> > work around it for now, or at least just get longer processing times
> >> > between
> >> > restarts?
> >> >
> >> > -Aaron
> >> >
> >> > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <marka...@hotmail.com>
> >> > wrote:
> >> >>
> >> >> Aaron,
> >> >>
> >> >> Thanks for getting that to us quickly! It is extremely useful.
> >> >>
> >> >> Joe,
> >> >>
> >> >> I do indeed believe this is the same thing. I was in the middle of
> >> >> typing
> >> >> a response, but you beat me to it!
> >> >>
> >> >> Thanks
> >> >> -Mark
> >> >>
> >> >>
> >> >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <joe.w...@gmail.com> wrote:
> >> >> >
> >> >> > Aaron, Mark,
> >> >> >
> >> >> > In looking at the thread-dump provided it looks to me like this is
> >> >> > the
> >> >> > same as what was reported and addressed in
> >> >> > https://issues.apache.org/jira/browse/NIFI-2395
> >> >> >
> >> >> > The fix for this has not yet been released but it slated to end up
> on
> >> >> > an 0.x and 1.0 release line.
> >> >> >
> >> >> > Mark do you agree it is the same thing by looking at the logs?
> >> >> >
> >> >> > Thanks
> >> >> > Joe
> >> >> >
> >> >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield
> >> >> > <alongfi...@gmail.com>
> >> >> > wrote:
> >> >> >> Alright, here you go for one of the nodes!
> >> >> >>
> >> >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <marka...@hotmail.com
> >
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> Aaron,
> >> >> >>>
> >> >> >>> Any time that you find NiFi stop performing its work, the best
> >> >> >>> thing
> >> >> >>> to do
> >> >> >>> is to perform a thread-dump to and
> >> >> >>> to the mailing list. This allows us to determine what exactly is
> >> >> >>> happening, so we know what action is being
> >> >> >>> performed that prevents any other progress.
> >> >> >>>
> >> >> >>> To do this, you can go to the NiFi node that is not performing
> and
> >> >> >>> run
> >> >> >>> the
> >> >> >>> command:
> >> >> >>>
> >> >> >>> bin/nifi.sh dump thread-dump.txt
> >> >> >>>
> >> >> >>> This will generate a file named thread-dump.txt that you can send
> >> >> >>> to
> >> >> >>> us.
> >> >> >>>
> >> >> >>> Thanks!
> >> >> >>> -Mark
> >> >> >>>
> >> >> >>>
> >> >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <
> alongfi...@gmail.com>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> I've been trying different things to try to fix my NiFi freeze
> >> >> >>> problems,
> >> >> >>> and it seems the most frequent reason that my cluster gets stuck
> >> >> >>> and
> >> >> >>> stops
> >> >> >>> processing has to do with network related processors.  My data
> >> >> >>> enters
> >> >> >>> the
> >> >> >>> environment from Kafka and leaves via a site-to-site output port.
> >> >> >>> After
> >> >> >>> some time processing (sometimes a few minutes, sometimes a few
> >> >> >>> hours)
> >> >> >>> one of
> >> >> >>> those will start logging connection errors, and then that node
> will
> >> >> >>> stop
> >> >> >>> processing any flowfiles across all processors.
> >> >> >>>
> >> >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux
> >> >> >>> to
> >> >> >>> RHEL7
> >> >> >>> (although RHEL seems to be happier).  I've tried restricting
> >> >> >>> threads
> >> >> >>> to less
> >> >> >>> than the number of available cores on each node, different heap
> >> >> >>> sizes,
> >> >> >>> and
> >> >> >>> different garbage collectors.  So far none of that has preventing
> >> >> >>> the
> >> >> >>> problem, unfortunately.
> >> >> >>>
> >> >> >>> I'm not quite ready to build all custom processors for my flow
> >> >> >>> logic...
> >> >> >>> most of it is straightforward attribute routing, text
> replacement,
> >> >> >>> and
> >> >> >>> flowfile merging.
> >> >> >>>
> >> >> >>> What are other things that I could try, or just be doing wrong
> that
> >> >> >>> could
> >> >> >>> lead to this?  I'm happy to keep trying suggestions and changes;
> I
> >> >> >>> really
> >> >> >>> want this to work!
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> -Aaron
> >> >> >>>
> >> >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <lee.l...@gmail.com>
> >> >> >>> wrote:
> >> >> >>>>
> >> >> >>>> Aaron,
> >> >> >>>>
> >> >> >>>> I ran into an issue where the Execute Stream Command (ESC)
> >> >> >>>> processor
> >> >> >>>> with
> >> >> >>>> many threads would run a legacy script that would hang if the
> >> >> >>>> incoming file
> >> >> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
> >> >> >>>> threads as
> >> >> >>>> malformed data randomly streamed through it. Eventually I ran
> out
> >> >> >>>> of
> >> >> >>>> threads
> >> >> >>>> as the system was just waiting for a thread to become available.
> >> >> >>>>
> >> >> >>>> It was apparent in the processor statistics where the
> >> >> >>>> flowfiles-out
> >> >> >>>> statistic would eventually step down to zero as threads became
> >> >> >>>> stuck.
> >> >> >>>>
> >> >> >>>> It might be worth trying InvokeScriptedProcessor or building
> >> >> >>>> custom
> >> >> >>>> processors as they provide a means to handle these
> inconsistencies
> >> >> >>>> more
> >> >> >>>> gracefully.
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
> >> >> >>>>
> >> >> >>>> Thanks,
> >> >> >>>> Lee
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield
> >> >> >>>> <alongfi...@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>
> >> >> >>>>> Hi Mark,
> >> >> >>>>>
> >> >> >>>>> I've been using the G1 garbage collector.  I brought the nodes
> >> >> >>>>> down
> >> >> >>>>> to
> >> >> >>>>> 8GB heap and let it run overnight, but processing still got
> stuck
> >> >> >>>>> and
> >> >> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
> >> >> >>>>> happen, but
> >> >> >>>>> they went down after a few hours.  Are there any other things I
> >> >> >>>>> can
> >> >> >>>>> look
> >> >> >>>>> into?
> >> >> >>>>>
> >> >> >>>>> Thanks!
> >> >> >>>>>
> >> >> >>>>> -Aaron
> >> >> >>>>>
> >> >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne
> >> >> >>>>> <marka...@hotmail.com>
> >> >> >>>>> wrote:
> >> >> >>>>>>
> >> >> >>>>>> Aaron,
> >> >> >>>>>>
> >> >> >>>>>> My guess would be that you are hitting a Full Garbage
> >> >> >>>>>> Collection.
> >> >> >>>>>> With
> >> >> >>>>>> such a huge Java heap, that will cause a "stop the world"
> pause
> >> >> >>>>>> for
> >> >> >>>>>> quite a
> >> >> >>>>>> long time.
> >> >> >>>>>> Which garbage collector are you using? Have you tried reducing
> >> >> >>>>>> the
> >> >> >>>>>> heap
> >> >> >>>>>> from 48 GB to say 4 or 8 GB?
> >> >> >>>>>>
> >> >> >>>>>> Thanks
> >> >> >>>>>> -Mark
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield
> >> >> >>>>>>> <alongfi...@gmail.com>
> >> >> >>>>>>> wrote:
> >> >> >>>>>>>
> >> >> >>>>>>> Hi,
> >> >> >>>>>>>
> >> >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster
> where
> >> >> >>>>>>> the
> >> >> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't
> >> >> >>>>>>> seen
> >> >> >>>>>>> any error
> >> >> >>>>>>> messages logged related to it, and when attempting to restart
> >> >> >>>>>>> the
> >> >> >>>>>>> service,
> >> >> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
> >> >> >>>>>>> causes multiple
> >> >> >>>>>>> flowfile version to hang around, and generally makes me feel
> >> >> >>>>>>> like
> >> >> >>>>>>> it might
> >> >> >>>>>>> be causing data loss.
> >> >> >>>>>>>
> >> >> >>>>>>> I'm running the web UI on a different box, and when things
> stop
> >> >> >>>>>>> working, it stops showing changes to counts in any queues,
> and
> >> >> >>>>>>> the
> >> >> >>>>>>> thread
> >> >> >>>>>>> count never changes.  It still thinks the nodes are
> connecting
> >> >> >>>>>>> and
> >> >> >>>>>>> responding, though.
> >> >> >>>>>>>
> >> >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB
> >> >> >>>>>>> given
> >> >> >>>>>>> to
> >> >> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited
> >> >> >>>>>>> to
> >> >> >>>>>>> 12, and
> >> >> >>>>>>> event threads to 4.  Install is on the current Amazon Linux
> AMI
> >> >> >>>>>>> and using
> >> >> >>>>>>> OpenJDK 1.8.0.91 x64.
> >> >> >>>>>>>
> >> >> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
> >> >> >>>>>>> running
> >> >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been
> occurring
> >> >> >>>>>>> with both
> >> >> >>>>>>> versions.  The higher the flowfile volume I push through, the
> >> >> >>>>>>> faster this
> >> >> >>>>>>> happens.
> >> >> >>>>>>>
> >> >> >>>>>>> Thanks for any help there is to give!
> >> >> >>>>>>>
> >> >> >>>>>>> -Aaron Longfield
> >> >> >>>>>>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >
> >
> >
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Reply via email to