Re: flowfiles stuck in load balanced queue; nifi 1.8

Mark Payne Wed, 26 Dec 2018 19:36:50 -0800

Ok great, thanks for the info! This at least tells me where to be 
investigating. Thanks!


-Mark

Sent from my iPhone

On Dec 26, 2018, at 10:31 PM, dan young 
<danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote:

Hello Mark,

I just stopped the destination processor, and then disconnected the node in 
question (nifi1-1). Once I disconnected the node, the flow file in the load 
balance connection disappeared from the queue.  After that, I reconnected the 
node (with the downstream processor disconnected) and once the node 
successfully rejoined the cluster, the flowfile showed up in the queue again. 
After this, I started the connected downstream processor, but the flowfile 
stays in the queue. The only way to clear the queue is if I actually restart 
the node.  If I disconnect the node, and then restart that node, the flowfile 
is no longer present in the queue.

Regards,

Dano


On Wed, Dec 26, 2018 at 6:13 PM Mark Payne 
<marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote:
Ok, I just wanted to confirm that when you said “once it rejoins the cluster 
that flow file is gone” that you mean “the flowfile did not exist on the 
system” and NOT “the queue size was 0 by the time that I looked at the UI.” 
I.e., is it possible that the FlowFile did exist, was restored, and then was 
processed before you looked at the UI? Or the FlowFile definitely did not exist 
after the node was restarted? That’s why I was suggesting that you restart with 
the connection’s source and destination stopped. Just to make sure that the 
FlowFile didn’t just get processed quickly on restart.

Sent from my iPhone

On Dec 26, 2018, at 7:55 PM, dan young 
<danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote:

Heya Mark,

If we restart the node, that "stuck" flowfile will disappear. This is the only 
way so far to clear out the flowfile. I usually disconnect the node, then once 
it's disconnected I restart nifi, and then once it rejoins the cluster that 
flow file is gone. If we try to empty the queue, it will just say that there no 
flow files in the queue.


On Wed, Dec 26, 2018, 5:22 PM Mark Payne 
<marka...@hotmail.com<mailto:marka...@hotmail.com> wrote:
Hey Dan,

Thanks, this is super useful! So, the following section is the damning part of 
the JSON:

          {
            "totalFlowFileCount": 1,
            "totalByteCount": 975890,
            "nodeIdentifier": "nifi1-1:9443",
            "localQueuePartition": {
              "totalFlowFileCount": 0,
              "totalByteCount": 0,
              "activeQueueFlowFileCount": 0,
              "activeQueueByteCount": 0,
              "swapFlowFileCount": 0,
              "swapByteCount": 0,
              "swapFiles": 0,
              "inFlightFlowFileCount": 0,
              "inFlightByteCount": 0,
              "allActiveQueueFlowFilesPenalized": false,
              "anyActiveQueueFlowFilesPenalized": false
            },
            "remoteQueuePartitions": [
              {
                "totalFlowFileCount": 0,
                "totalByteCount": 0,
                "activeQueueFlowFileCount": 0,
                "activeQueueByteCount": 0,
                "swapFlowFileCount": 0,
                "swapByteCount": 0,
                "swapFiles": 0,
                "inFlightFlowFileCount": 0,
                "inFlightByteCount": 0,
                "nodeIdentifier": "nifi2-1:9443"
              },
              {
                "totalFlowFileCount": 0,
                "totalByteCount": 0,
                "activeQueueFlowFileCount": 0,
                "activeQueueByteCount": 0,
                "swapFlowFileCount": 0,
                "swapByteCount": 0,
                "swapFiles": 0,
                "inFlightFlowFileCount": 0,
                "inFlightByteCount": 0,
                "nodeIdentifier": "nifi3-1:9443"
              }
            ]
          }

It indicates that node nifi1-1 is showing a queue size of 1 FlowFile, 975890 
bytes. But it also shows that the FlowFile is not in the "local partition" or 
either of the two "remote partitions." So that leaves us with two possibilities:

1) The Queue's Count is wrong, because it somehow did not get decremented 
(perhaps a threading bug?)

Or

2) The Count is correct and the FlowFile exists, but somehow the reference to 
the FlowFile was lost by the FlowFile Queue (again, perhaps a threading bug?)

If possible, I would for you to stop both the source and destination of that 
connection and then restart node nifi1-1. Once it has restarted, check if the 
FlowFile is still in the connection. That will tell us which of the two above 
scenarios is taking place. If the FlowFile exists upon restart, then the Queue 
somehow lost the handle to it. If the FlowFile does not exist in the connection 
upon restart (I'm guessing this will be the case), then it indicates that 
somehow the count is incorrect.

Many thanks
-Mark

________________________________
From: dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>>
Sent: Wednesday, December 26, 2018 9:18 AM
To: NiFi Mailing List
Subject: Re: flowfiles stuck in load balanced queue; nifi 1.8

Heya Mark,

So I added a Log Attribute Processor and routed the connection that had the 
"stuck" flowfile to it.   I ran a get diagnostics to the Log Attribute 
processor before I started it, and then ran another diagnostics after I started 
it.  The flowfile stayed in the load balanced connection/queue.  I've attached 
both files.  Please LMK if this helps.

Regards,

Dano


On Mon, Dec 24, 2018 at 10:35 AM Mark Payne 
<marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote:
Dan,

You would want to get diagnostics for the processor that is the 
source/destination of the connection - not the FlowFile. But if you connection 
is connecting 2 process groups then both its source and destination are Ports, 
not Processors. So the easiest thing to do would be to drop a “dummy processor” 
into the flow between the 2 groups, drag the Connection to that processor, get 
diagnostics for the processor, and then drag it back to where it was. Does that 
make sense? Sorry for the hassle.

Thanks
-Mark

Sent from my iPhone

On Dec 24, 2018, at 11:40 AM, dan young 
<danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote:

Hello Bryan,

Thank you, that was the ticket!

Mark, I was able to run the diagnostics for a processor that's downstream from 
the connection where the flowfile appears to be "stuck". I'm not sure what 
processor is the source of this particular "stuck" flowfile since we have a 
number of upstream processor groups (PG) that feed into a funnel.  This funnel 
is then connected to a downstream PG. It is this connection between the funnel 
and a downstream PG where the flowfile is stuck. I might reduce the upstream 
"load balanced connections" between the various PGs to just one so I can narrow 
where we need to run diagnostics....  If this isn't the correct processor to be 
gathering diagnostics, please LMK where else I should look or other diagnostics 
to run...

I've also attached the output (nifi-api/connections/{id}) of the get for that 
connection where the flowfile appears to be "stuck"

On Sun, Dec 23, 2018 at 8:36 PM Bryan Bende 
<bbe...@gmail.com<mailto:bbe...@gmail.com>> wrote:
You’ll need to get the token that was obtained when you logged in to the SSO 
and submit it on the curl requests the same way the UI is doing on all requests.

You should be able to open chrome dev tool tools while in the UI and look at 
one of the request/responses and copy the value of the 'Authorization’ header 
which should be in the form ‘Bearer <token>’.

Then send this on the curl command by specifying a header of -H 'Authorization: 
Bearer <token>'

On Sun, Dec 23, 2018 at 6:28 PM dan young 
<danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote:
I forgot to mention that we're using the OpenId Connect SSO .  Is there a way 
to run these command via curl when we have the cluster configured this way?  If 
so would anyone be able to provide some insight/examples.

Happy Holidays!

Regards,

Dano

On Sun, Dec 23, 2018 at 3:53 PM dan young 
<danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote:
This is what I'm seeing in the logs when I try to access the 
nifi-api/flow/about for example...


2018-12-23 22:51:45,579 INFO [NiFi Web Server-24201] 
o.a.n.w.s.NiFiAuthenticationFilter Authentication success for 
d...@looker.com<mailto:d...@looker.com>

2018-12-23 22:52:01,375 INFO [NiFi Web Server-24136] 
o.a.n.w.a.c.AccessDeniedExceptionMapper identity[anonymous], groups[none] does 
not have permission to access the requested resource. Unknown user with 
identity 'anonymous'. Returning Unauthorized response.

On Sun, Dec 23, 2018 at 3:50 PM dan young 
<danoyo...@gmail.com<mailto:danoyo...@gmail.com>> wrote:
Hello Mark,

I have a queue again with a "stuck/phantom" flowfile again.  When I try to call 
the nifi-api/processors/<processor-id>/diagnostics against a processor, in the 
UI after I authenticate, I get a "Unknown user with identity 'anonymous'. 
Contact the system administrator." We're running a secure 3x node cluster. I 
tried this via the browser and also via the command line with curl on one of 
the nodes. One clarification point, what processor id should I be trying to 
gather the diagnostics on? the the queue is in between two processor groups.

Maybe the issue with the Unknown User has to do with some policy I don't have 
set correctly?

Happy Holidays!

Regards,
Dano




On Wed, Dec 19, 2018 at 6:51 AM Mark Payne 
<marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote:
Hey Josef, Dano,

Firstly, let me assure you that while I may be the only one from the NiFi side 
who's been engaging on debugging
this, I am far from the only one who cares about it! :) This is a pretty big 
new feature that was added to the latest
release, so understandably there are probably not yet a lot of people who 
understand the code well enough to
debug. I have tried replicating the issue, but have not been successful. I have 
a 3-node cluster that ran for well over
a month without a restart, and i've also tried restarting it every few hours 
for a couple of days. It has about 8 different
load-balanced connections, with varying data sizes and volumes. I've not been 
able to get into this situation, though,
unfortunately.

But yes, I think that we've seen this issue arise from each of the two of you 
and one other on the mailing list, so it
is certainly something that we need to nail down ASAP. Unfortunately, debugging 
an issue that involves communication
between multiple nodes is often difficult to fully understand, so it may not be 
a trivial task to debug.

Dano, if you are able to get to the diagnostics, as Josef mentioned, that is 
likely to be pretty helpful. Off the top of my head,
there are a few possibilities that are coming to mind, as to what kind of bug 
could cause such behavior:

1) Perhaps there really is no flowfile in the queue, but we somehow 
miscalculated the size of the queue. The diagnostics
info would tell us whether or not this is the case. It will look into the 
queues themselves to determine how many FlowFiles are
destined for each node in the cluster, rather than just returning the 
pre-calculated count. Failing that, you could also stop the source
and destination of the queue, restart the node, and then see if the FlowFile is 
entirely gone from the queue on restart, or if it remains
in the queue. If it is gone, then that likely indicates that the pre-computed 
count is somehow off.

2) We are having trouble communicating with the node that we are trying to send 
the data to. I would expect some sort of ERROR
log messages in this case.

3) The node is properly sending the FlowFile to where it needs to go, but for 
some reason the receiving node is then re-distributing it
to another node in the cluster, which then re-distributes it again, so that it 
never ends in the correct destination. I think this is unlikely
and would be easy to verify by looking at the "Summary" table [1] and doing the 
"Cluster view" and constantly refreshing for a few seconds
to see if the queue changes on any node in the cluster.

4) For some entirely unknown reason, there exists a bug that causes the node to 
simply see the FlowFile and just skip over it
entirely.

For additional logging, we can enable DEBUG logging on
org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask:
<logger 
name="org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask"
 level="DEBUG" />

With that DEBUG logging turned on, it may or may not generate a lot of DEBUG 
logs. If it does not, then that in and of itself tells us something.
If it does generate a lot of DEBUG logs, then it would be good to see what it's 
dumping out in the logs.

And a big Thank You to you guys for staying engaged on this and your 
willingness to dig in!

Thanks!
-Mark

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Summary_Page


On Dec 19, 2018, at 2:18 AM, 
<josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> 
<josef.zahn...@swisscom.com<mailto:josef.zahn...@swisscom.com>> wrote:

Hi Dano

Seems that the problem has been seen by a few people but until now nobody from 
NiFi team really cared about it – except Mark Payne. He mentioned the part 
below with the diagnostics, however in my case this doesn’t even work (tried it 
on standalone unsecured cluster as well as on secured cluster)! Can you get the 
diagnostics on your cluster?

I guess at the end we have to open a Jira ticket to narrow it down.

Cheers Josef


One thing that I would recommend, to get more information, is to go to the REST 
endpoint (in your browser is fine)
/nifi-api/processors/<processor id>/diagnostics

Where <processor id> is the UUID of either the source or the destination of the 
Connection in question. This gives us
a lot of information about the internals of Connection. The easiest way to get 
that Processor ID is to just click on the
processor on the canvas and look at the Operate palette on the left-hand side. 
You can copy & paste from there. If you
then send the diagnostics information to us, we can analyze that to help 
understand what's happening.



From: dan young <danoyo...@gmail.com<mailto:danoyo...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Wednesday, 19 December 2018 at 05:28
To: NiFi Mailing List <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: flowfiles stuck in load balanced queue; nifi 1.8

We're seeing this more frequently where flowfiles seem to be stuck in a load 
balanced queue.  The only resolution is to disconnect the node and then restart 
that node.  After this, the flowfile disappears from the queue.  Any ideas on 
what might be going on here or what additional information I might be able to 
provide to debug this?

I've attached another thread dump and some screen shots....


Regards,

Dano

--
Sent from Gmail Mobile
<Screen Shot 2018-12-24 at 9.12.31 AM.png>
<diag.json>
<conn.json>

Re: flowfiles stuck in load balanced queue; nifi 1.8

Reply via email to