I've converted over our flows based on your recommendation, will monitor and report back if I see any issues....
On Fri, Dec 28, 2018 at 8:43 AM Mark Payne <marka...@hotmail.com> wrote: > Dan, et al, > > Great news! I was able to replicate this issue finally, by creating a > Load-Balanced connection > between two Process Groups/Ports instead of between two processors. The > fact that it's between > two Ports does not, in and of itself, matter. But there is a race > condition, and Ports do no actual > Processing of the FlowFile (simply pull it from one queue and transfer it > to another). As a result, because > it is extremely fast, it is more likely to trigger the race condition. > > So I created a JIRA [1] and have submitted a PR for it. > > Interestingly, while there is no real workaround that is fool-proof, until > this fix is in and released, you could > choose to update your flow so that the connection between Process Groups > is not load balanced and instead > the connection between the Input Port and the first Processor is load > balanced. Again, this is not fool-proof, > because it could affect the Load Balanced Connection even if it is > connected to a Processor, but it is less likely > to do so, so you would likely see the issue occur far less often. > > Thank you so much for sticking with us all as we diagnose this and figure > it all out - would not have been able to > figure it out without you spending the time to debug the issue! > > Thanks > -Mark > > [1] https://issues.apache.org/jira/browse/NIFI-5919 > > > On Dec 26, 2018, at 10:31 PM, dan young <danoyo...@gmail.com> wrote: > > Hello Mark, > > I just stopped the destination processor, and then disconnected the node > in question (nifi1-1). Once I disconnected the node, the flow file in the > load balance connection disappeared from the queue. After that, I > reconnected the node (with the downstream processor disconnected) and once > the node successfully rejoined the cluster, the flowfile showed up in the > queue again. After this, I started the connected downstream processor, but > the flowfile stays in the queue. The only way to clear the queue is if I > actually restart the node. If I disconnect the node, and then restart that > node, the flowfile is no longer present in the queue. > > Regards, > > Dano > > > On Wed, Dec 26, 2018 at 6:13 PM Mark Payne <marka...@hotmail.com> wrote: > >> Ok, I just wanted to confirm that when you said “once it rejoins the >> cluster that flow file is gone” that you mean “the flowfile did not exist >> on the system” and NOT “the queue size was 0 by the time that I looked at >> the UI.” I.e., is it possible that the FlowFile did exist, was restored, >> and then was processed before you looked at the UI? Or the FlowFile >> definitely did not exist after the node was restarted? That’s why I was >> suggesting that you restart with the connection’s source and destination >> stopped. Just to make sure that the FlowFile didn’t just get processed >> quickly on restart. >> >> Sent from my iPhone >> >> On Dec 26, 2018, at 7:55 PM, dan young <danoyo...@gmail.com> wrote: >> >> Heya Mark, >> >> If we restart the node, that "stuck" flowfile will disappear. This is the >> only way so far to clear out the flowfile. I usually disconnect the node, >> then once it's disconnected I restart nifi, and then once it rejoins the >> cluster that flow file is gone. If we try to empty the queue, it will just >> say that there no flow files in the queue. >> >> >> On Wed, Dec 26, 2018, 5:22 PM Mark Payne <marka...@hotmail.com wrote: >> >>> Hey Dan, >>> >>> Thanks, this is super useful! So, the following section is the damning >>> part of the JSON: >>> >>> { >>> "totalFlowFileCount": 1, >>> "totalByteCount": 975890, >>> "nodeIdentifier": "nifi1-1:9443", >>> "localQueuePartition": { >>> "totalFlowFileCount": 0, >>> "totalByteCount": 0, >>> "activeQueueFlowFileCount": 0, >>> "activeQueueByteCount": 0, >>> "swapFlowFileCount": 0, >>> "swapByteCount": 0, >>> "swapFiles": 0, >>> "inFlightFlowFileCount": 0, >>> "inFlightByteCount": 0, >>> "allActiveQueueFlowFilesPenalized": false, >>> "anyActiveQueueFlowFilesPenalized": false >>> }, >>> "remoteQueuePartitions": [ >>> { >>> "totalFlowFileCount": 0, >>> "totalByteCount": 0, >>> "activeQueueFlowFileCount": 0, >>> "activeQueueByteCount": 0, >>> "swapFlowFileCount": 0, >>> "swapByteCount": 0, >>> "swapFiles": 0, >>> "inFlightFlowFileCount": 0, >>> "inFlightByteCount": 0, >>> "nodeIdentifier": "nifi2-1:9443" >>> }, >>> { >>> "totalFlowFileCount": 0, >>> "totalByteCount": 0, >>> "activeQueueFlowFileCount": 0, >>> "activeQueueByteCount": 0, >>> "swapFlowFileCount": 0, >>> "swapByteCount": 0, >>> "swapFiles": 0, >>> "inFlightFlowFileCount": 0, >>> "inFlightByteCount": 0, >>> "nodeIdentifier": "nifi3-1:9443" >>> } >>> ] >>> } >>> >>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 >>> bytes. But it also shows that the FlowFile is not in the "local partition" >>> or either of the two "remote partitions." So that leaves us with two >>> possibilities: >>> >>> 1) The Queue's Count is wrong, because it somehow did not get >>> decremented (perhaps a threading bug?) >>> >>> Or >>> >>> 2) The Count is correct and the FlowFile exists, but somehow the >>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a >>> threading bug?) >>> >>> If possible, I would for you to stop both the source and destination of >>> that connection and then restart node nifi1-1. Once it has restarted, check >>> if the FlowFile is still in the connection. That will tell us which of the >>> two above scenarios is taking place. If the FlowFile exists upon restart, >>> then the Queue somehow lost the handle to it. If the FlowFile does not >>> exist in the connection upon restart (I'm guessing this will be the case), >>> then it indicates that somehow the count is incorrect. >>> >>> Many thanks >>> -Mark >>> >>> ------------------------------ >>> *From:* dan young <danoyo...@gmail.com> >>> *Sent:* Wednesday, December 26, 2018 9:18 AM >>> *To:* NiFi Mailing List >>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8 >>> >>> Heya Mark, >>> >>> So I added a Log Attribute Processor and routed the connection that had >>> the "stuck" flowfile to it. I ran a get diagnostics to the Log Attribute >>> processor before I started it, and then ran another diagnostics after I >>> started it. The flowfile stayed in the load balanced connection/queue. >>> I've attached both files. Please LMK if this helps. >>> >>> Regards, >>> >>> Dano >>> >>> >>> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne <marka...@hotmail.com> >>> wrote: >>> >>> Dan, >>> >>> You would want to get diagnostics for the processor that is the >>> source/destination of the connection - not the FlowFile. But if you >>> connection is connecting 2 process groups then both its source and >>> destination are Ports, not Processors. So the easiest thing to do would be >>> to drop a “dummy processor” into the flow between the 2 groups, drag the >>> Connection to that processor, get diagnostics for the processor, and then >>> drag it back to where it was. Does that make sense? Sorry for the hassle. >>> >>> Thanks >>> -Mark >>> >>> Sent from my iPhone >>> >>> On Dec 24, 2018, at 11:40 AM, dan young <danoyo...@gmail.com> wrote: >>> >>> Hello Bryan, >>> >>> Thank you, that was the ticket! >>> >>> Mark, I was able to run the diagnostics for a processor that's >>> downstream from the connection where the flowfile appears to be "stuck". >>> I'm not sure what processor is the source of this particular "stuck" >>> flowfile since we have a number of upstream processor groups (PG) that feed >>> into a funnel. This funnel is then connected to a downstream PG. It is >>> this connection between the funnel and a downstream PG where the flowfile >>> is stuck. I might reduce the upstream "load balanced connections" between >>> the various PGs to just one so I can narrow where we need to run >>> diagnostics.... If this isn't the correct processor to be gathering >>> diagnostics, please LMK where else I should look or other diagnostics to >>> run... >>> >>> I've also attached the output (nifi-api/connections/{id}) of the get for >>> that connection where the flowfile appears to be "stuck" >>> >>> On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende <bbe...@gmail.com> wrote: >>> >>> You’ll need to get the token that was obtained when you logged in to the >>> SSO and submit it on the curl requests the same way the UI is doing on all >>> requests. >>> >>> You should be able to open chrome dev tool tools while in the UI and >>> look at one of the request/responses and copy the value of the >>> 'Authorization’ >>> header which should be in the form ‘Bearer <token>’. >>> >>> Then send this on the curl command by specifying a header of -H >>> 'Authorization: Bearer <token>' >>> >>> On Sun, Dec 23, 2018 at 6:28 PM dan young <danoyo...@gmail.com> wrote: >>> >>> I forgot to mention that we're using the OpenId Connect SSO . Is there >>> a way to run these command via curl when we have the cluster configured >>> this way? If so would anyone be able to provide some insight/examples. >>> >>> Happy Holidays! >>> >>> Regards, >>> >>> Dano >>> >>> On Sun, Dec 23, 2018 at 3:53 PM dan young <danoyo...@gmail.com> wrote: >>> >>> This is what I'm seeing in the logs when I try to access >>> the nifi-api/flow/about for example... >>> >>> 2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201] >>> o.a.n.w.s.NiFiAuthenticationFilter Authentication success for >>> d...@looker.com >>> 2018-12-23 22:52:01,375 INFO [NiFi Web Server-24136] >>> o.a.n.w.a.c.AccessDeniedExceptionMapper identity[anonymous], groups[none] >>> does not have permission to access the requested resource. Unknown user >>> with identity 'anonymous'. Returning Unauthorized response. >>> >>> On Sun, Dec 23, 2018 at 3:50 PM dan young <danoyo...@gmail.com> wrote: >>> >>> Hello Mark, >>> >>> I have a queue again with a "stuck/phantom" flowfile again. When I try >>> to call the nifi-api/processors/<processor-id>/diagnostics against a >>> processor, in the UI after I authenticate, I get a "Unknown user with >>> identity 'anonymous'. Contact the system administrator." We're running a >>> secure 3x node cluster. I tried this via the browser and also via the >>> command line with curl on one of the nodes. One clarification point, what >>> processor id should I be trying to gather the diagnostics on? the the queue >>> is in between two processor groups. >>> >>> Maybe the issue with the Unknown User has to do with some policy I don't >>> have set correctly? >>> >>> Happy Holidays! >>> >>> Regards, >>> Dano >>> >>> >>> >>> >>> On Wed, Dec 19, 2018 at 6:51 AM Mark Payne <marka...@hotmail.com> wrote: >>> >>> Hey Josef, Dano, >>> >>> Firstly, let me assure you that while I may be the only one from the >>> NiFi side who's been engaging on debugging >>> this, I am far from the only one who cares about it! :) This is a pretty >>> big new feature that was added to the latest >>> release, so understandably there are probably not yet a lot of people >>> who understand the code well enough to >>> debug. I have tried replicating the issue, but have not been successful. >>> I have a 3-node cluster that ran for well over >>> a month without a restart, and i've also tried restarting it every few >>> hours for a couple of days. It has about 8 different >>> load-balanced connections, with varying data sizes and volumes. I've not >>> been able to get into this situation, though, >>> unfortunately. >>> >>> But yes, I think that we've seen this issue arise from each of the two >>> of you and one other on the mailing list, so it >>> is certainly something that we need to nail down ASAP. Unfortunately, >>> debugging an issue that involves communication >>> between multiple nodes is often difficult to fully understand, so it may >>> not be a trivial task to debug. >>> >>> Dano, if you are able to get to the diagnostics, as Josef mentioned, >>> that is likely to be pretty helpful. Off the top of my head, >>> there are a few possibilities that are coming to mind, as to what kind >>> of bug could cause such behavior: >>> >>> 1) Perhaps there really is no flowfile in the queue, but we somehow >>> miscalculated the size of the queue. The diagnostics >>> info would tell us whether or not this is the case. It will look into >>> the queues themselves to determine how many FlowFiles are >>> destined for each node in the cluster, rather than just returning the >>> pre-calculated count. Failing that, you could also stop the source >>> and destination of the queue, restart the node, and then see if the >>> FlowFile is entirely gone from the queue on restart, or if it remains >>> in the queue. If it is gone, then that likely indicates that the >>> pre-computed count is somehow off. >>> >>> 2) We are having trouble communicating with the node that we are trying >>> to send the data to. I would expect some sort of ERROR >>> log messages in this case. >>> >>> 3) The node is properly sending the FlowFile to where it needs to go, >>> but for some reason the receiving node is then re-distributing it >>> to another node in the cluster, which then re-distributes it again, so >>> that it never ends in the correct destination. I think this is unlikely >>> and would be easy to verify by looking at the "Summary" table [1] and >>> doing the "Cluster view" and constantly refreshing for a few seconds >>> to see if the queue changes on any node in the cluster. >>> >>> 4) For some entirely unknown reason, there exists a bug that causes the >>> node to simply see the FlowFile and just skip over it >>> entirely. >>> >>> For additional logging, we can enable DEBUG logging on >>> org.apache.nifi.controller.queue.clustered.client.async.nio. >>> NioAsyncLoadBalanceClientTask: >>> <logger name=" >>> org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask" >>> level="DEBUG" /> >>> >>> With that DEBUG logging turned on, it may or may not generate a lot of >>> DEBUG logs. If it does not, then that in and of itself tells us something. >>> If it does generate a lot of DEBUG logs, then it would be good to see >>> what it's dumping out in the logs. >>> >>> And a big Thank You to you guys for staying engaged on this and your >>> willingness to dig in! >>> >>> Thanks! >>> -Mark >>> >>> [1] >>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page >>> >>> >>> On Dec 19, 2018, at 2:18 AM, <josef.zahn...@swisscom.com> < >>> josef.zahn...@swisscom.com> wrote: >>> >>> Hi Dano >>> >>> Seems that the problem has been seen by a few people but until now >>> nobody from NiFi team really cared about it – except Mark Payne. He >>> mentioned the part below with the diagnostics, however in my case this >>> doesn’t even work (tried it on standalone unsecured cluster as well as on >>> secured cluster)! Can you get the diagnostics on your cluster? >>> >>> I guess at the end we have to open a Jira ticket to narrow it down. >>> >>> Cheers Josef >>> >>> >>> One thing that I would recommend, to get more information, is to go to >>> the REST endpoint (in your browser is fine) >>> /nifi-api/processors/<processor id>/diagnostics >>> >>> Where <processor id> is the UUID of either the source or the destination >>> of the Connection in question. This gives us >>> a lot of information about the internals of Connection. The easiest way >>> to get that Processor ID is to just click on the >>> processor on the canvas and look at the Operate palette on the left-hand >>> side. You can copy & paste from there. If you >>> then send the diagnostics information to us, we can analyze that to help >>> understand what's happening. >>> >>> >>> >>> *From: *dan young <danoyo...@gmail.com> >>> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org> >>> *Date: *Wednesday, 19 December 2018 at 05:28 >>> *To: *NiFi Mailing List <users@nifi.apache.org> >>> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8 >>> >>> We're seeing this more frequently where flowfiles seem to be stuck in a >>> load balanced queue. The only resolution is to disconnect the node and >>> then restart that node. After this, the flowfile disappears from the >>> queue. Any ideas on what might be going on here or what additional >>> information I might be able to provide to debug this? >>> >>> I've attached another thread dump and some screen shots.... >>> >>> >>> Regards, >>> >>> Dano >>> >>> >>> -- >>> Sent from Gmail Mobile >>> >>> <Screen Shot 2018-12-24 at 9.12.31 AM.png> >>> >>> <diag.json> >>> >>> <conn.json> >>> >>> >