Ok great, thanks for the info! This at least tells me where to be investigating. Thanks!
-Mark Sent from my iPhone On Dec 26, 2018, at 10:31 PM, dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote: Hello Mark, I just stopped the destination processor, and then disconnected the node in question (nifi1-1). Once I disconnected the node, the flow file in the load balance connection disappeared from the queue. After that, I reconnected the node (with the downstream processor disconnected) and once the node successfully rejoined the cluster, the flowfile showed up in the queue again. After this, I started the connected downstream processor, but the flowfile stays in the queue. The only way to clear the queue is if I actually restart the node. If I disconnect the node, and then restart that node, the flowfile is no longer present in the queue. Regards, Dano On Wed, Dec 26, 2018 at 6:13 PM Mark Payne <marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote: Ok, I just wanted to confirm that when you said “once it rejoins the cluster that flow file is gone” that you mean “the flowfile did not exist on the system” and NOT “the queue size was 0 by the time that I looked at the UI.” I.e., is it possible that the FlowFile did exist, was restored, and then was processed before you looked at the UI? Or the FlowFile definitely did not exist after the node was restarted? That’s why I was suggesting that you restart with the connection’s source and destination stopped. Just to make sure that the FlowFile didn’t just get processed quickly on restart. Sent from my iPhone On Dec 26, 2018, at 7:55 PM, dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote: Heya Mark, If we restart the node, that "stuck" flowfile will disappear. This is the only way so far to clear out the flowfile. I usually disconnect the node, then once it's disconnected I restart nifi, and then once it rejoins the cluster that flow file is gone. If we try to empty the queue, it will just say that there no flow files in the queue. On Wed, Dec 26, 2018, 5:22 PM Mark Payne <marka...@hotmail.com<mailto:marka...@hotmail.com> wrote: Hey Dan, Thanks, this is super useful! So, the following section is the damning part of the JSON: { "totalFlowFileCount": 1, "totalByteCount": 975890, "nodeIdentifier": "nifi1-1:9443", "localQueuePartition": { "totalFlowFileCount": 0, "totalByteCount": 0, "activeQueueFlowFileCount": 0, "activeQueueByteCount": 0, "swapFlowFileCount": 0, "swapByteCount": 0, "swapFiles": 0, "inFlightFlowFileCount": 0, "inFlightByteCount": 0, "allActiveQueueFlowFilesPenalized": false, "anyActiveQueueFlowFilesPenalized": false }, "remoteQueuePartitions": [ { "totalFlowFileCount": 0, "totalByteCount": 0, "activeQueueFlowFileCount": 0, "activeQueueByteCount": 0, "swapFlowFileCount": 0, "swapByteCount": 0, "swapFiles": 0, "inFlightFlowFileCount": 0, "inFlightByteCount": 0, "nodeIdentifier": "nifi2-1:9443" }, { "totalFlowFileCount": 0, "totalByteCount": 0, "activeQueueFlowFileCount": 0, "activeQueueByteCount": 0, "swapFlowFileCount": 0, "swapByteCount": 0, "swapFiles": 0, "inFlightFlowFileCount": 0, "inFlightByteCount": 0, "nodeIdentifier": "nifi3-1:9443" } ] } It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 bytes. But it also shows that the FlowFile is not in the "local partition" or either of the two "remote partitions." So that leaves us with two possibilities: 1) The Queue's Count is wrong, because it somehow did not get decremented (perhaps a threading bug?) Or 2) The Count is correct and the FlowFile exists, but somehow the reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?) If possible, I would for you to stop both the source and destination of that connection and then restart node nifi1-1. Once it has restarted, check if the FlowFile is still in the connection. That will tell us which of the two above scenarios is taking place. If the FlowFile exists upon restart, then the Queue somehow lost the handle to it. If the FlowFile does not exist in the connection upon restart (I'm guessing this will be the case), then it indicates that somehow the count is incorrect. Many thanks -Mark ________________________________ From: dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> Sent: Wednesday, December 26, 2018 9:18 AM To: NiFi Mailing List Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8 Heya Mark, So I added a Log Attribute Processor and routed the connection that had the "stuck" flowfile to it. I ran a get diagnostics to the Log Attribute processor before I started it, and then ran another diagnostics after I started it. The flowfile stayed in the load balanced connection/queue. I've attached both files. Please LMK if this helps. Regards, Dano On Mon, Dec 24, 2018 at 10:35 AM Mark Payne <marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote: Dan, You would want to get diagnostics for the processor that is the source/destination of the connection - not the FlowFile. But if you connection is connecting 2 process groups then both its source and destination are Ports, not Processors. So the easiest thing to do would be to drop a “dummy processor” into the flow between the 2 groups, drag the Connection to that processor, get diagnostics for the processor, and then drag it back to where it was. Does that make sense? Sorry for the hassle. Thanks -Mark Sent from my iPhone On Dec 24, 2018, at 11:40 AM, dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote: Hello Bryan, Thank you, that was the ticket! Mark, I was able to run the diagnostics for a processor that's downstream from the connection where the flowfile appears to be "stuck". I'm not sure what processor is the source of this particular "stuck" flowfile since we have a number of upstream processor groups (PG) that feed into a funnel. This funnel is then connected to a downstream PG. It is this connection between the funnel and a downstream PG where the flowfile is stuck. I might reduce the upstream "load balanced connections" between the various PGs to just one so I can narrow where we need to run diagnostics.... If this isn't the correct processor to be gathering diagnostics, please LMK where else I should look or other diagnostics to run... I've also attached the output (nifi-api/connections/{id}) of the get for that connection where the flowfile appears to be "stuck" On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende <bbe...@gmail.com<mailto:bbe...@gmail.com>> wrote: You’ll need to get the token that was obtained when you logged in to the SSO and submit it on the curl requests the same way the UI is doing on all requests. You should be able to open chrome dev tool tools while in the UI and look at one of the request/responses and copy the value of the 'Authorization’ header which should be in the form ‘Bearer <token>’. Then send this on the curl command by specifying a header of -H 'Authorization: Bearer <token>' On Sun, Dec 23, 2018 at 6:28 PM dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote: I forgot to mention that we're using the OpenId Connect SSO . Is there a way to run these command via curl when we have the cluster configured this way? If so would anyone be able to provide some insight/examples. Happy Holidays! Regards, Dano On Sun, Dec 23, 2018 at 3:53 PM dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote: This is what I'm seeing in the logs when I try to access the nifi-api/flow/about for example... 2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201] o.a.n.w.s.NiFiAuthenticationFilter Authentication success for d...@looker.com<mailto:d...@looker.com> 2018-12-23 22:52:01,375 INFO [NiFi Web Server-24136] o.a.n.w.a.c.AccessDeniedExceptionMapper identity[anonymous], groups[none] does not have permission to access the requested resource. Unknown user with identity 'anonymous'. Returning Unauthorized response. On Sun, Dec 23, 2018 at 3:50 PM dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote: Hello Mark, I have a queue again with a "stuck/phantom" flowfile again. When I try to call the nifi-api/processors/<processor-id>/diagnostics against a processor, in the UI after I authenticate, I get a "Unknown user with identity 'anonymous'. Contact the system administrator." We're running a secure 3x node cluster. I tried this via the browser and also via the command line with curl on one of the nodes. One clarification point, what processor id should I be trying to gather the diagnostics on? the the queue is in between two processor groups. Maybe the issue with the Unknown User has to do with some policy I don't have set correctly? Happy Holidays! Regards, Dano On Wed, Dec 19, 2018 at 6:51 AM Mark Payne <marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote: Hey Josef, Dano, Firstly, let me assure you that while I may be the only one from the NiFi side who's been engaging on debugging this, I am far from the only one who cares about it! :) This is a pretty big new feature that was added to the latest release, so understandably there are probably not yet a lot of people who understand the code well enough to debug. I have tried replicating the issue, but have not been successful. I have a 3-node cluster that ran for well over a month without a restart, and i've also tried restarting it every few hours for a couple of days. It has about 8 different load-balanced connections, with varying data sizes and volumes. I've not been able to get into this situation, though, unfortunately. But yes, I think that we've seen this issue arise from each of the two of you and one other on the mailing list, so it is certainly something that we need to nail down ASAP. Unfortunately, debugging an issue that involves communication between multiple nodes is often difficult to fully understand, so it may not be a trivial task to debug. Dano, if you are able to get to the diagnostics, as Josef mentioned, that is likely to be pretty helpful. Off the top of my head, there are a few possibilities that are coming to mind, as to what kind of bug could cause such behavior: 1) Perhaps there really is no flowfile in the queue, but we somehow miscalculated the size of the queue. The diagnostics info would tell us whether or not this is the case. It will look into the queues themselves to determine how many FlowFiles are destined for each node in the cluster, rather than just returning the pre-calculated count. Failing that, you could also stop the source and destination of the queue, restart the node, and then see if the FlowFile is entirely gone from the queue on restart, or if it remains in the queue. If it is gone, then that likely indicates that the pre-computed count is somehow off. 2) We are having trouble communicating with the node that we are trying to send the data to. I would expect some sort of ERROR log messages in this case. 3) The node is properly sending the FlowFile to where it needs to go, but for some reason the receiving node is then re-distributing it to another node in the cluster, which then re-distributes it again, so that it never ends in the correct destination. I think this is unlikely and would be easy to verify by looking at the "Summary" table [1] and doing the "Cluster view" and constantly refreshing for a few seconds to see if the queue changes on any node in the cluster. 4) For some entirely unknown reason, there exists a bug that causes the node to simply see the FlowFile and just skip over it entirely. For additional logging, we can enable DEBUG logging on org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask: <logger name="org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask" level="DEBUG" /> With that DEBUG logging turned on, it may or may not generate a lot of DEBUG logs. If it does not, then that in and of itself tells us something. If it does generate a lot of DEBUG logs, then it would be good to see what it's dumping out in the logs. And a big Thank You to you guys for staying engaged on this and your willingness to dig in! Thanks! -Mark [1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page On Dec 19, 2018, at 2:18 AM, <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> <josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> wrote: Hi Dano Seems that the problem has been seen by a few people but until now nobody from NiFi team really cared about it – except Mark Payne. He mentioned the part below with the diagnostics, however in my case this doesn’t even work (tried it on standalone unsecured cluster as well as on secured cluster)! Can you get the diagnostics on your cluster? I guess at the end we have to open a Jira ticket to narrow it down. Cheers Josef One thing that I would recommend, to get more information, is to go to the REST endpoint (in your browser is fine) /nifi-api/processors/<processor id>/diagnostics Where <processor id> is the UUID of either the source or the destination of the Connection in question. This gives us a lot of information about the internals of Connection. The easiest way to get that Processor ID is to just click on the processor on the canvas and look at the Operate palette on the left-hand side. You can copy & paste from there. If you then send the diagnostics information to us, we can analyze that to help understand what's happening. From: dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>> Date: Wednesday, 19 December 2018 at 05:28 To: NiFi Mailing List <users@nifi.apache.org<mailto:users@nifi.apache.org>> Subject: flowfiles stuck in load balanced queue; nifi 1.8 We're seeing this more frequently where flowfiles seem to be stuck in a load balanced queue. The only resolution is to disconnect the node and then restart that node. After this, the flowfile disappears from the queue. Any ideas on what might be going on here or what additional information I might be able to provide to debug this? I've attached another thread dump and some screen shots.... Regards, Dano -- Sent from Gmail Mobile <Screen Shot 2018-12-24 at 9.12.31 AM.png> <diag.json> <conn.json>