Aaron, Excellent! Glad that you're seeing better results. Sorry about that. Let us know if you run into any other strangeness!
Thanks -Mark > On Aug 3, 2016, at 6:18 PM, Aaron Longfield <alongfi...@gmail.com> wrote: > > I backported the patch from the master branch and it applies without changing > much at all. Workflow processing works fine by my eye, but I do see quite a > few provenance warnings logged. I haven't tried out to see how that > repository is working yet, but I just pushed a few million flowfiles through > my flows, and output probably 25GB to a remote processor without anything > falling over! > > -Aaron > > On Mon, Aug 1, 2016 at 4:08 PM, Joe Witt <joe.w...@gmail.com > <mailto:joe.w...@gmail.com>> wrote: > Aaron, > > Ok so from a production point of view I'd recommend a small patched > version of the 0.7 release you were working with. It might be the > case that grafting the master line patch for that JIRA into an 0.x > patch is pretty straight forward. You could take a look at that as a > short term option. We probably should start 0.7.1 and 1.0-M1 type > release motions soon anyway so this could be a helpful catalyst. > > Thanks > Joe > > On Mon, Aug 1, 2016 at 4:03 PM, Aaron Longfield <alongfi...@gmail.com > <mailto:alongfi...@gmail.com>> wrote: > > Joe, > > > > Sure, I can give that a go. Any serious bugs that I might run across with > > that branch that should make me worried about running it on a production > > flow? > > > > -Aaron > > > > On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <joe.w...@gmail.com > > <mailto:joe.w...@gmail.com>> wrote: > >> > >> Aaron, > >> > >> It doesn't look like the 0.x version of that patch has been created > >> yet. Any chance you could build master (slated for upcoming 1.x > >> release) and try that? > >> > >> Thanks > >> Joe > >> > >> On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <alongfi...@gmail.com > >> <mailto:alongfi...@gmail.com>> > >> wrote: > >> > Great, glad there's already a fixed bug for it! Is there anything I try > >> > to > >> > work around it for now, or at least just get longer processing times > >> > between > >> > restarts? > >> > > >> > -Aaron > >> > > >> > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <marka...@hotmail.com > >> > <mailto:marka...@hotmail.com>> > >> > wrote: > >> >> > >> >> Aaron, > >> >> > >> >> Thanks for getting that to us quickly! It is extremely useful. > >> >> > >> >> Joe, > >> >> > >> >> I do indeed believe this is the same thing. I was in the middle of > >> >> typing > >> >> a response, but you beat me to it! > >> >> > >> >> Thanks > >> >> -Mark > >> >> > >> >> > >> >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <joe.w...@gmail.com > >> >> > <mailto:joe.w...@gmail.com>> wrote: > >> >> > > >> >> > Aaron, Mark, > >> >> > > >> >> > In looking at the thread-dump provided it looks to me like this is > >> >> > the > >> >> > same as what was reported and addressed in > >> >> > https://issues.apache.org/jira/browse/NIFI-2395 > >> >> > <https://issues.apache.org/jira/browse/NIFI-2395> > >> >> > > >> >> > The fix for this has not yet been released but it slated to end up on > >> >> > an 0.x and 1.0 release line. > >> >> > > >> >> > Mark do you agree it is the same thing by looking at the logs? > >> >> > > >> >> > Thanks > >> >> > Joe > >> >> > > >> >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield > >> >> > <alongfi...@gmail.com <mailto:alongfi...@gmail.com>> > >> >> > wrote: > >> >> >> Alright, here you go for one of the nodes! > >> >> >> > >> >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <marka...@hotmail.com > >> >> >> <mailto:marka...@hotmail.com>> > >> >> >> wrote: > >> >> >>> > >> >> >>> Aaron, > >> >> >>> > >> >> >>> Any time that you find NiFi stop performing its work, the best > >> >> >>> thing > >> >> >>> to do > >> >> >>> is to perform a thread-dump to and > >> >> >>> to the mailing list. This allows us to determine what exactly is > >> >> >>> happening, so we know what action is being > >> >> >>> performed that prevents any other progress. > >> >> >>> > >> >> >>> To do this, you can go to the NiFi node that is not performing and > >> >> >>> run > >> >> >>> the > >> >> >>> command: > >> >> >>> > >> >> >>> bin/nifi.sh dump thread-dump.txt > >> >> >>> > >> >> >>> This will generate a file named thread-dump.txt that you can send > >> >> >>> to > >> >> >>> us. > >> >> >>> > >> >> >>> Thanks! > >> >> >>> -Mark > >> >> >>> > >> >> >>> > >> >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <alongfi...@gmail.com > >> >> >>> <mailto:alongfi...@gmail.com>> > >> >> >>> wrote: > >> >> >>> > >> >> >>> I've been trying different things to try to fix my NiFi freeze > >> >> >>> problems, > >> >> >>> and it seems the most frequent reason that my cluster gets stuck > >> >> >>> and > >> >> >>> stops > >> >> >>> processing has to do with network related processors. My data > >> >> >>> enters > >> >> >>> the > >> >> >>> environment from Kafka and leaves via a site-to-site output port. > >> >> >>> After > >> >> >>> some time processing (sometimes a few minutes, sometimes a few > >> >> >>> hours) > >> >> >>> one of > >> >> >>> those will start logging connection errors, and then that node will > >> >> >>> stop > >> >> >>> processing any flowfiles across all processors. > >> >> >>> > >> >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux > >> >> >>> to > >> >> >>> RHEL7 > >> >> >>> (although RHEL seems to be happier). I've tried restricting > >> >> >>> threads > >> >> >>> to less > >> >> >>> than the number of available cores on each node, different heap > >> >> >>> sizes, > >> >> >>> and > >> >> >>> different garbage collectors. So far none of that has preventing > >> >> >>> the > >> >> >>> problem, unfortunately. > >> >> >>> > >> >> >>> I'm not quite ready to build all custom processors for my flow > >> >> >>> logic... > >> >> >>> most of it is straightforward attribute routing, text replacement, > >> >> >>> and > >> >> >>> flowfile merging. > >> >> >>> > >> >> >>> What are other things that I could try, or just be doing wrong that > >> >> >>> could > >> >> >>> lead to this? I'm happy to keep trying suggestions and changes; I > >> >> >>> really > >> >> >>> want this to work! > >> >> >>> > >> >> >>> Thanks, > >> >> >>> -Aaron > >> >> >>> > >> >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <lee.l...@gmail.com > >> >> >>> <mailto:lee.l...@gmail.com>> > >> >> >>> wrote: > >> >> >>>> > >> >> >>>> Aaron, > >> >> >>>> > >> >> >>>> I ran into an issue where the Execute Stream Command (ESC) > >> >> >>>> processor > >> >> >>>> with > >> >> >>>> many threads would run a legacy script that would hang if the > >> >> >>>> incoming file > >> >> >>>> was 'inconsistent'. It appeared that ESC slowly collected stuck > >> >> >>>> threads as > >> >> >>>> malformed data randomly streamed through it. Eventually I ran out > >> >> >>>> of > >> >> >>>> threads > >> >> >>>> as the system was just waiting for a thread to become available. > >> >> >>>> > >> >> >>>> It was apparent in the processor statistics where the > >> >> >>>> flowfiles-out > >> >> >>>> statistic would eventually step down to zero as threads became > >> >> >>>> stuck. > >> >> >>>> > >> >> >>>> It might be worth trying InvokeScriptedProcessor or building > >> >> >>>> custom > >> >> >>>> processors as they provide a means to handle these inconsistencies > >> >> >>>> more > >> >> >>>> gracefully. > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html > >> >> >>>> > >> >> >>>> <https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html> > >> >> >>>> > >> >> >>>> Thanks, > >> >> >>>> Lee > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield > >> >> >>>> <alongfi...@gmail.com <mailto:alongfi...@gmail.com>> > >> >> >>>> wrote: > >> >> >>>>> > >> >> >>>>> Hi Mark, > >> >> >>>>> > >> >> >>>>> I've been using the G1 garbage collector. I brought the nodes > >> >> >>>>> down > >> >> >>>>> to > >> >> >>>>> 8GB heap and let it run overnight, but processing still got stuck > >> >> >>>>> and > >> >> >>>>> requiring NiFi to be restarted on all nodes. It took longer to > >> >> >>>>> happen, but > >> >> >>>>> they went down after a few hours. Are there any other things I > >> >> >>>>> can > >> >> >>>>> look > >> >> >>>>> into? > >> >> >>>>> > >> >> >>>>> Thanks! > >> >> >>>>> > >> >> >>>>> -Aaron > >> >> >>>>> > >> >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne > >> >> >>>>> <marka...@hotmail.com <mailto:marka...@hotmail.com>> > >> >> >>>>> wrote: > >> >> >>>>>> > >> >> >>>>>> Aaron, > >> >> >>>>>> > >> >> >>>>>> My guess would be that you are hitting a Full Garbage > >> >> >>>>>> Collection. > >> >> >>>>>> With > >> >> >>>>>> such a huge Java heap, that will cause a "stop the world" pause > >> >> >>>>>> for > >> >> >>>>>> quite a > >> >> >>>>>> long time. > >> >> >>>>>> Which garbage collector are you using? Have you tried reducing > >> >> >>>>>> the > >> >> >>>>>> heap > >> >> >>>>>> from 48 GB to say 4 or 8 GB? > >> >> >>>>>> > >> >> >>>>>> Thanks > >> >> >>>>>> -Mark > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield > >> >> >>>>>>> <alongfi...@gmail.com <mailto:alongfi...@gmail.com>> > >> >> >>>>>>> wrote: > >> >> >>>>>>> > >> >> >>>>>>> Hi, > >> >> >>>>>>> > >> >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where > >> >> >>>>>>> the > >> >> >>>>>>> nodes will stop processing any queued flowfiles. I haven't > >> >> >>>>>>> seen > >> >> >>>>>>> any error > >> >> >>>>>>> messages logged related to it, and when attempting to restart > >> >> >>>>>>> the > >> >> >>>>>>> service, > >> >> >>>>>>> NiFi doesn't respond and the script forcibly kills it. This > >> >> >>>>>>> causes multiple > >> >> >>>>>>> flowfile version to hang around, and generally makes me feel > >> >> >>>>>>> like > >> >> >>>>>>> it might > >> >> >>>>>>> be causing data loss. > >> >> >>>>>>> > >> >> >>>>>>> I'm running the web UI on a different box, and when things stop > >> >> >>>>>>> working, it stops showing changes to counts in any queues, and > >> >> >>>>>>> the > >> >> >>>>>>> thread > >> >> >>>>>>> count never changes. It still thinks the nodes are connecting > >> >> >>>>>>> and > >> >> >>>>>>> responding, though. > >> >> >>>>>>> > >> >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB > >> >> >>>>>>> given > >> >> >>>>>>> to > >> >> >>>>>>> the NiFi JVM in bootstrap.conf. I have timer threads limited > >> >> >>>>>>> to > >> >> >>>>>>> 12, and > >> >> >>>>>>> event threads to 4. Install is on the current Amazon Linux AMI > >> >> >>>>>>> and using > >> >> >>>>>>> OpenJDK 1.8.0.91 x64. > >> >> >>>>>>> > >> >> >>>>>>> Any idea, other debug steps, or changes that I can try? I'm > >> >> >>>>>>> running > >> >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring > >> >> >>>>>>> with both > >> >> >>>>>>> versions. The higher the flowfile volume I push through, the > >> >> >>>>>>> faster this > >> >> >>>>>>> happens. > >> >> >>>>>>> > >> >> >>>>>>> Thanks for any help there is to give! > >> >> >>>>>>> > >> >> >>>>>>> -Aaron Longfield > >> >> >>>>>> > >> >> >>>>> > >> >> >>>> > >> >> >>> > >> >> >>> > >> >> >> > >> >> > >> > > > > > >