Re: flowfiles stuck in load balanced queue; nifi 1.8

dan young Fri, 28 Dec 2018 09:04:51 -0800

I've converted over our flows based on your recommendation, will monitor
and report back if I see any issues....


On Fri, Dec 28, 2018 at 8:43 AM Mark Payne <marka...@hotmail.com> wrote:

> Dan, et al,
>
> Great news! I was able to replicate this issue finally, by creating a
> Load-Balanced connection
> between two Process Groups/Ports instead of between two processors. The
> fact that it's between
> two Ports does not, in and of itself, matter. But there is a race
> condition, and Ports do no actual
> Processing of the FlowFile (simply pull it from one queue and transfer it
> to another). As a result, because
> it is extremely fast, it is more likely to trigger the race condition.
>
> So I created a JIRA [1] and have submitted a PR for it.
>
> Interestingly, while there is no real workaround that is fool-proof, until
> this fix is in and released, you could
> choose to update your flow so that the connection between Process Groups
> is not load balanced and instead
> the connection between the Input Port and the first Processor is load
> balanced. Again, this is not fool-proof,
> because it could affect the Load Balanced Connection even if it is
> connected to a Processor, but it is less likely
> to do so, so you would likely see the issue occur far less often.
>
> Thank you so much for sticking with us all as we diagnose this and figure
> it all out - would not have been able to
> figure it out without you spending the time to debug the issue!
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-5919
>
>
> On Dec 26, 2018, at 10:31 PM, dan young <danoyo...@gmail.com> wrote:
>
> Hello Mark,
>
> I just stopped the destination processor, and then disconnected the node
> in question (nifi1-1). Once I disconnected the node, the flow file in the
> load balance connection disappeared from the queue.  After that, I
> reconnected the node (with the downstream processor disconnected) and once
> the node successfully rejoined the cluster, the flowfile showed up in the
> queue again. After this, I started the connected downstream processor, but
> the flowfile stays in the queue. The only way to clear the queue is if I
> actually restart the node.  If I disconnect the node, and then restart that
> node, the flowfile is no longer present in the queue.
>
> Regards,
>
> Dano
>
>
> On Wed, Dec 26, 2018 at 6:13 PM Mark Payne <marka...@hotmail.com> wrote:
>
>> Ok, I just wanted to confirm that when you said “once it rejoins the
>> cluster that flow file is gone” that you mean “the flowfile did not exist
>> on the system” and NOT “the queue size was 0 by the time that I looked at
>> the UI.” I.e., is it possible that the FlowFile did exist, was restored,
>> and then was processed before you looked at the UI? Or the FlowFile
>> definitely did not exist after the node was restarted? That’s why I was
>> suggesting that you restart with the connection’s source and destination
>> stopped. Just to make sure that the FlowFile didn’t just get processed
>> quickly on restart.
>>
>> Sent from my iPhone
>>
>> On Dec 26, 2018, at 7:55 PM, dan young <danoyo...@gmail.com> wrote:
>>
>> Heya Mark,
>>
>> If we restart the node, that "stuck" flowfile will disappear. This is the
>> only way so far to clear out the flowfile. I usually disconnect the node,
>> then once it's disconnected I restart nifi, and then once it rejoins the
>> cluster that flow file is gone. If we try to empty the queue, it will just
>> say that there no flow files in the queue.
>>
>>
>> On Wed, Dec 26, 2018, 5:22 PM Mark Payne <marka...@hotmail.com wrote:
>>
>>> Hey Dan,
>>>
>>> Thanks, this is super useful! So, the following section is the damning
>>> part of the JSON:
>>>
>>>           {
>>>             "totalFlowFileCount": 1,
>>>             "totalByteCount": 975890,
>>>             "nodeIdentifier": "nifi1-1:9443",
>>>             "localQueuePartition": {
>>>               "totalFlowFileCount": 0,
>>>               "totalByteCount": 0,
>>>               "activeQueueFlowFileCount": 0,
>>>               "activeQueueByteCount": 0,
>>>               "swapFlowFileCount": 0,
>>>               "swapByteCount": 0,
>>>               "swapFiles": 0,
>>>               "inFlightFlowFileCount": 0,
>>>               "inFlightByteCount": 0,
>>>               "allActiveQueueFlowFilesPenalized": false,
>>>               "anyActiveQueueFlowFilesPenalized": false
>>>             },
>>>             "remoteQueuePartitions": [
>>>               {
>>>                 "totalFlowFileCount": 0,
>>>                 "totalByteCount": 0,
>>>                 "activeQueueFlowFileCount": 0,
>>>                 "activeQueueByteCount": 0,
>>>                 "swapFlowFileCount": 0,
>>>                 "swapByteCount": 0,
>>>                 "swapFiles": 0,
>>>                 "inFlightFlowFileCount": 0,
>>>                 "inFlightByteCount": 0,
>>>                 "nodeIdentifier": "nifi2-1:9443"
>>>               },
>>>               {
>>>                 "totalFlowFileCount": 0,
>>>                 "totalByteCount": 0,
>>>                 "activeQueueFlowFileCount": 0,
>>>                 "activeQueueByteCount": 0,
>>>                 "swapFlowFileCount": 0,
>>>                 "swapByteCount": 0,
>>>                 "swapFiles": 0,
>>>                 "inFlightFlowFileCount": 0,
>>>                 "inFlightByteCount": 0,
>>>                 "nodeIdentifier": "nifi3-1:9443"
>>>               }
>>>             ]
>>>           }
>>>
>>> It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890
>>> bytes. But it also shows that the FlowFile is not in the "local partition"
>>> or either of the two "remote partitions." So that leaves us with two
>>> possibilities:
>>>
>>> 1) The Queue's Count is wrong, because it somehow did not get
>>> decremented (perhaps a threading bug?)
>>>
>>> Or
>>>
>>> 2) The Count is correct and the FlowFile exists, but somehow the
>>> reference to the FlowFile was lost by the FlowFile Queue (again, perhaps a
>>> threading bug?)
>>>
>>> If possible, I would for you to stop both the source and destination of
>>> that connection and then restart node nifi1-1. Once it has restarted, check
>>> if the FlowFile is still in the connection. That will tell us which of the
>>> two above scenarios is taking place. If the FlowFile exists upon restart,
>>> then the Queue somehow lost the handle to it. If the FlowFile does not
>>> exist in the connection upon restart (I'm guessing this will be the case),
>>> then it indicates that somehow the count is incorrect.
>>>
>>> Many thanks
>>> -Mark
>>>
>>> ------------------------------
>>> *From:* dan young <danoyo...@gmail.com>
>>> *Sent:* Wednesday, December 26, 2018 9:18 AM
>>> *To:* NiFi Mailing List
>>> *Subject:* Re: flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> Heya Mark,
>>>
>>> So I added a Log Attribute Processor and routed the connection that had
>>> the "stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute
>>> processor before I started it, and then ran another diagnostics after I
>>> started it.  The flowfile stayed in the load balanced connection/queue.
>>> I've attached both files.  Please LMK if this helps.
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> On Mon, Dec 24, 2018 at 10:35 AM Mark Payne <marka...@hotmail.com>
>>> wrote:
>>>
>>> Dan,
>>>
>>> You would want to get diagnostics for the processor that is the
>>> source/destination of the connection - not the FlowFile. But if you
>>> connection is connecting 2 process groups then both its source and
>>> destination are Ports, not Processors. So the easiest thing to do would be
>>> to drop a “dummy processor” into the flow between the 2 groups, drag the
>>> Connection to that processor, get diagnostics for the processor, and then
>>> drag it back to where it was. Does that make sense? Sorry for the hassle.
>>>
>>> Thanks
>>> -Mark
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 24, 2018, at 11:40 AM, dan young <danoyo...@gmail.com> wrote:
>>>
>>> Hello Bryan,
>>>
>>> Thank you, that was the ticket!
>>>
>>> Mark, I was able to run the diagnostics for a processor that's
>>> downstream from the connection where the flowfile appears to be "stuck".
>>> I'm not sure what processor is the source of this particular "stuck"
>>> flowfile since we have a number of upstream processor groups (PG) that feed
>>> into a funnel.  This funnel is then connected to a downstream PG. It is
>>> this connection between the funnel and a downstream PG where the flowfile
>>> is stuck. I might reduce the upstream "load balanced connections" between
>>> the various PGs to just one so I can narrow where we need to run
>>> diagnostics....  If this isn't the correct processor to be gathering
>>> diagnostics, please LMK where else I should look or other diagnostics to
>>> run...
>>>
>>> I've also attached the output (nifi-api/connections/{id}) of the get for
>>> that connection where the flowfile appears to be "stuck"
>>>
>>> On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende <bbe...@gmail.com> wrote:
>>>
>>> You’ll need to get the token that was obtained when you logged in to the
>>> SSO and submit it on the curl requests the same way the UI is doing on all
>>> requests.
>>>
>>> You should be able to open chrome dev tool tools while in the UI and
>>> look at one of the request/responses and copy the value of the 
>>> 'Authorization’
>>> header which should be in the form ‘Bearer <token>’.
>>>
>>> Then send this on the curl command by specifying a header of -H
>>> 'Authorization: Bearer <token>'
>>>
>>> On Sun, Dec 23, 2018 at 6:28 PM dan young <danoyo...@gmail.com> wrote:
>>>
>>> I forgot to mention that we're using the OpenId Connect SSO .  Is there
>>> a way to run these command via curl when we have the cluster configured
>>> this way?  If so would anyone be able to provide some insight/examples.
>>>
>>> Happy Holidays!
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>> On Sun, Dec 23, 2018 at 3:53 PM dan young <danoyo...@gmail.com> wrote:
>>>
>>> This is what I'm seeing in the logs when I try to access
>>> the nifi-api/flow/about for example...
>>>
>>> 2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201]
>>> o.a.n.w.s.NiFiAuthenticationFilter Authentication success for
>>> d...@looker.com
>>> 2018-12-23 22:52:01,375 INFO [NiFi Web Server-24136]
>>> o.a.n.w.a.c.AccessDeniedExceptionMapper identity[anonymous], groups[none]
>>> does not have permission to access the requested resource. Unknown user
>>> with identity 'anonymous'. Returning Unauthorized response.
>>>
>>> On Sun, Dec 23, 2018 at 3:50 PM dan young <danoyo...@gmail.com> wrote:
>>>
>>> Hello Mark,
>>>
>>> I have a queue again with a "stuck/phantom" flowfile again.  When I try
>>> to call the nifi-api/processors/<processor-id>/diagnostics against a
>>> processor, in the UI after I authenticate, I get a "Unknown user with
>>> identity 'anonymous'. Contact the system administrator." We're running a
>>> secure 3x node cluster. I tried this via the browser and also via the
>>> command line with curl on one of the nodes. One clarification point, what
>>> processor id should I be trying to gather the diagnostics on? the the queue
>>> is in between two processor groups.
>>>
>>> Maybe the issue with the Unknown User has to do with some policy I don't
>>> have set correctly?
>>>
>>> Happy Holidays!
>>>
>>> Regards,
>>> Dano
>>>
>>>
>>>
>>>
>>> On Wed, Dec 19, 2018 at 6:51 AM Mark Payne <marka...@hotmail.com> wrote:
>>>
>>> Hey Josef, Dano,
>>>
>>> Firstly, let me assure you that while I may be the only one from the
>>> NiFi side who's been engaging on debugging
>>> this, I am far from the only one who cares about it! :) This is a pretty
>>> big new feature that was added to the latest
>>> release, so understandably there are probably not yet a lot of people
>>> who understand the code well enough to
>>> debug. I have tried replicating the issue, but have not been successful.
>>> I have a 3-node cluster that ran for well over
>>> a month without a restart, and i've also tried restarting it every few
>>> hours for a couple of days. It has about 8 different
>>> load-balanced connections, with varying data sizes and volumes. I've not
>>> been able to get into this situation, though,
>>> unfortunately.
>>>
>>> But yes, I think that we've seen this issue arise from each of the two
>>> of you and one other on the mailing list, so it
>>> is certainly something that we need to nail down ASAP. Unfortunately,
>>> debugging an issue that involves communication
>>> between multiple nodes is often difficult to fully understand, so it may
>>> not be a trivial task to debug.
>>>
>>> Dano, if you are able to get to the diagnostics, as Josef mentioned,
>>> that is likely to be pretty helpful. Off the top of my head,
>>> there are a few possibilities that are coming to mind, as to what kind
>>> of bug could cause such behavior:
>>>
>>> 1) Perhaps there really is no flowfile in the queue, but we somehow
>>> miscalculated the size of the queue. The diagnostics
>>> info would tell us whether or not this is the case. It will look into
>>> the queues themselves to determine how many FlowFiles are
>>> destined for each node in the cluster, rather than just returning the
>>> pre-calculated count. Failing that, you could also stop the source
>>> and destination of the queue, restart the node, and then see if the
>>> FlowFile is entirely gone from the queue on restart, or if it remains
>>> in the queue. If it is gone, then that likely indicates that the
>>> pre-computed count is somehow off.
>>>
>>> 2) We are having trouble communicating with the node that we are trying
>>> to send the data to. I would expect some sort of ERROR
>>> log messages in this case.
>>>
>>> 3) The node is properly sending the FlowFile to where it needs to go,
>>> but for some reason the receiving node is then re-distributing it
>>> to another node in the cluster, which then re-distributes it again, so
>>> that it never ends in the correct destination. I think this is unlikely
>>> and would be easy to verify by looking at the "Summary" table [1] and
>>> doing the "Cluster view" and constantly refreshing for a few seconds
>>> to see if the queue changes on any node in the cluster.
>>>
>>> 4) For some entirely unknown reason, there exists a bug that causes the
>>> node to simply see the FlowFile and just skip over it
>>> entirely.
>>>
>>> For additional logging, we can enable DEBUG logging on
>>> org.apache.nifi.controller.queue.clustered.client.async.nio.
>>> NioAsyncLoadBalanceClientTask:
>>> <logger name="
>>> org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask"
>>> level="DEBUG" />
>>>
>>> With that DEBUG logging turned on, it may or may not generate a lot of
>>> DEBUG logs. If it does not, then that in and of itself tells us something.
>>> If it does generate a lot of DEBUG logs, then it would be good to see
>>> what it's dumping out in the logs.
>>>
>>> And a big Thank You to you guys for staying engaged on this and your
>>> willingness to dig in!
>>>
>>> Thanks!
>>> -Mark
>>>
>>> [1]
>>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page
>>>
>>>
>>> On Dec 19, 2018, at 2:18 AM, <josef.zahn...@swisscom.com> <
>>> josef.zahn...@swisscom.com> wrote:
>>>
>>> Hi Dano
>>>
>>> Seems that the problem has been seen by a few people but until now
>>> nobody from NiFi team really cared about it – except Mark Payne. He
>>> mentioned the part below with the diagnostics, however in my case this
>>> doesn’t even work (tried it on standalone unsecured cluster as well as on
>>> secured cluster)! Can you get the diagnostics on your cluster?
>>>
>>> I guess at the end we have to open a Jira ticket to narrow it down.
>>>
>>> Cheers Josef
>>>
>>>
>>> One thing that I would recommend, to get more information, is to go to
>>> the REST endpoint (in your browser is fine)
>>> /nifi-api/processors/<processor id>/diagnostics
>>>
>>> Where <processor id> is the UUID of either the source or the destination
>>> of the Connection in question. This gives us
>>> a lot of information about the internals of Connection. The easiest way
>>> to get that Processor ID is to just click on the
>>> processor on the canvas and look at the Operate palette on the left-hand
>>> side. You can copy & paste from there. If you
>>> then send the diagnostics information to us, we can analyze that to help
>>> understand what's happening.
>>>
>>>
>>>
>>> *From: *dan young <danoyo...@gmail.com>
>>> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org>
>>> *Date: *Wednesday, 19 December 2018 at 05:28
>>> *To: *NiFi Mailing List <users@nifi.apache.org>
>>> *Subject: *flowfiles stuck in load balanced queue; nifi 1.8
>>>
>>> We're seeing this more frequently where flowfiles seem to be stuck in a
>>> load balanced queue.  The only resolution is to disconnect the node and
>>> then restart that node.  After this, the flowfile disappears from the
>>> queue.  Any ideas on what might be going on here or what additional
>>> information I might be able to provide to debug this?
>>>
>>> I've attached another thread dump and some screen shots....
>>>
>>>
>>> Regards,
>>>
>>> Dano
>>>
>>>
>>> --
>>> Sent from Gmail Mobile
>>>
>>> <Screen Shot 2018-12-24 at 9.12.31 AM.png>
>>>
>>> <diag.json>
>>>
>>> <conn.json>
>>>
>>>
>

Re: flowfiles stuck in load balanced queue; nifi 1.8

Reply via email to