Hi again, Unfortunately I didn't save the logs when the node restarted, but I don't remember anything that provided me a clue regarding the reason of the blocked queue.
I just have a few logs during the week-end when the queues were in this strange state: 2017-05-14 09:01:29,635 INFO [pool-12-thread-1] org.wali. MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@7bdc0ad3 checkpointed with 85016 Records and 18 Swap Files in 5447 milliseconds (Stop-the-world time = 37 milliseconds, Clear Edit Logs time = 30 millis), max Transaction ID 307183065 2017-05-14 09:04:48,056 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@265c0752 checkpointed with 2 Records and 0 Swap Files in 22 milliseconds (Stop-the-world time = 1 milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 7 2017-05-14 09:05:37,737 INFO [pool-12-thread-1] org.wali. MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@7bdc0ad3 checkpointed with 85016 Records and 18 Swap Files in 4677 milliseconds (Stop-the-world time = 35 milliseconds, Clear Edit Logs time = 17 millis), max Transaction ID 307183065 2017-05-14 09:11:50,435 INFO [pool-12-thread-1] o.a.n.c.r. WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 85016 records in 4057 milliseconds This log is reporting 85K records that were not available if I requested the queue status (the queue was always empty) and overall the queue were reporting more elements (over 100K) As we can see this records were stuck as 2 hours later they were still there, and other records were flowing nicely during the week-end in the cluster. 2017-05-14 11:01:12,839 INFO [pool-12-thread-1] o.a.n.c.r. WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 85016 records in 3408 milliseconds Regarding disk space, I don't think that it ran out of space at any moment. I even have a local backup of the flowfile directory that I did before emptying it. I hope it helps. Arnaud On Tue, May 16, 2017 at 3:51 PM, Mark Payne <marka...@hotmail.com> wrote: > Arnaud, > > Did you have any WARN or ERROR messages in the logs? I'm particular > interested in anything > that mentions the word "Swap" or "swap" (i.e., regardless of case). Is it > possible that the FlowFile Repository > could have run out of disk space? > > Thanks > -Mark > > On May 16, 2017, at 3:34 AM, Arnaud G <greatpat...@gmail.com> wrote: > > Hi Matt, > > Thanks for your reply! > > I finally solved the problem by deleting all the content in the flowfile > directory, but here are my observations: > > 1) The problem was coming from one of the cluster node, when this node was > out of the cluster, the queue were reporting 0 flowfile. > 2) The first time I restarted this node, about 20'000 flowfile reappeared > and were treated, every time I subsequently restarted this node about > 20-30k flowfiles were again treated (I was only specifically monitoring one > queue, but it happened for multiple other queues) > 3) After 3-4 reboots of this node the queue reported 90K elements and > remained in this state despite multiple other restart. > 4) The flowfile directory on this node contained 200 MB of data > 5) I tried to setup the flowfile expiration but it didn't do anything to > the queue status > 6) I tried to change the backpressure threshold without any effect. > 7) During the problem the queue was operating normally on the cluster, and > flowfiles were flowing through it without any issue. > > Arnaud > > > > On Mon, May 15, 2017 at 10:39 PM, Matt Gilman <matt.c.gil...@gmail.com> > wrote: > >> Sorry for the delayed response. Similar behavior has been reported by >> some other users [1]. Does the connection have any back pressure threshold >> configured? Can new flowfiles be enqueued? Do the expiration settings have >> any affect? >> >> Lastly, if you restart the cluster does it claim the connection still has >> flowfiles enqueued? >> >> Matt >> >> [1] https://issues.apache.org/jira/browse/NIFI-3897 >> >> On Fri, May 12, 2017 at 5:47 AM, Arnaud G <greatpat...@gmail.com> wrote: >> >>> Hi again! >>> >>> I currently have another issue with incoherent queue status. >>> >>> Following the upgrade to 1.2 of a cluster, I have a couple of queues >>> that display through the GUI a high number of flowfiles. >>> >>> As the queue were no emptying despite tuning, I tried to list the >>> content of the queue. This action returns that the queue contains no >>> flowfile, which is not the expected as the GUI displays another value. >>> >>> If I try to empty the queue, I receive a message: 0 FlowFiles (0 bytes) >>> out of 210'000 (92.71MB) were removed from the queue. >>> >>> And of course I cannot delete the queue as this action reports me that >>> the queue is not empty. >>> >>> So somehow it seems that the queue are empty but that the current >>> display of the queue don't reflect it (it is very likely that some data >>> were lost during the upgrade procedure as we had to reboot a few node to >>> change the heap property) >>> >>> What will be the best method to restore a proper state and be able to >>> edit the flow file again? >>> >>> Thank you! >>> >>> Arnaud >>> >>> >>> >> > >