OK, so these are generally the default values for most linux systems. These are a little low, though for what NiFi recommends and often needs. With these settings, you can easily run out of open file handles. When this happens, trying to access a file will return a FileNotFoundException even though the file exists and permissions all look good. As a result, NiFi may be failing to delete the data simply because it can't get an open file handle.
The admin guide [1] explains the best practices for configuring these settings. Generally, after updating these settings, I think you have to logout of the machine and login again for the changes to take effect. Would recommend you update these settings and also search logs for "FileNotFound" as well as "NoSuchFile" and see if that hits anywhere. [1] http://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#configuration-best-practices On Dec 14, 2016, at 8:25 PM, Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> wrote: We haven't let the disk hit 100% in a while, but it's been crossing 90%. We haven't seen the "Unable to checkpoint" message in the last 24 hours. $ ulimit -Hn 4096 $ ulimit -Sn 1024 I will work on tracking a specific file next. On Wed, Dec 14, 2016 at 8:17 PM, Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> wrote: At first I thought that the drained messages always said 0, but that's not right. What should the total number of claims drained be? The number of flowfiles that made it through the system? If so, I think our number is low: $ grep "StandardResourceClaimManager Drained" nifi-app_2016-12-14* | grep -v "Drained 0" | awk '{sum += $9} END {print sum}' 25296 I'm not sure how to get the count of flowfiles that moved through, but I suspect that's low by an order of magnitude. That instance of nifi has handled 150k files in the last 6 hours, most of which went through a number of processors and transformations. Should the number of drained claims correspond to the number of flow files that moved through the system? Alan On Wed, Dec 14, 2016 at 6:59 PM, Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> wrote: Some updates: * We fixed the issue with missing transfer relationships, and this did not go away. * We saw this a few minutes ago when the queue was at 0. What should I be looking for in the logs to figure out the issue? Thanks, Alan On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> wrote: In case this is interesting, I think this started getting bad when we started hitting an error where some of our files were not given a transfer relationship. Maybe some combination of not giving flow files a relationship and the subsequent penalization is causing the problem. On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> wrote: Everything is at the default locations for these nifis. On one of the two machines, I did find log messages like you suggested: 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile Repository due to java.io.FileNotFoundException: ./flowfile_repository/partition-14/3169.journal (No space left on device) I added the logger, which apparently takes effect right away. What am I looking for in this logs? I see a lot of stuff like: 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555893660-3174, container=default, section=102] to 0 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555922818-3275, container=default, section=203] to 191 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555880393-3151, container=default, section=79] to 142 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555872053-3146, container=default, section=74] to 441 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555893954-3178, container=default, section=106] to 2 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555893696-3175, container=default, section=103] to 1 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1] o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims to [] What's puzzling to me is that both of these machines have > 100GB of free space, and I have never seen the queued size go above 20GB. It seems to me like it gets into a state where nothing is deleted long before it runs out of disk space. Thanks, Alan On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote: Alan, Thanks for the thread-dump and the in-depth analysis! So in terms of the two tasks there, here's a quick explanation of what each does: ArchiveOrDestroyDestructableClaims - When a Resource Claim (which maps to a file on disk) is no longer referenced by any FlowFile, it can be either archived or destroyed (depending on whether the property in nifi.properties has archiving enabled). DestroyExpiredArchiveClaims - When archiving is enabled, the Resource Claims that are archived have to eventually age off. This task is responsible for ensuring that this happens. As you mentioned, in the Executor, if the Runnable fails it will stop running forever, and if the thread gets stuck, another will not be launched. Neither of these appears to be the case. I say this because both of those Runnables are wrapped entirely within a try { ... } catch (Throwable t) {...}. So the method will never end Exceptionally. Also, the thread dump shows all of the threads created by that Thread Pool (those whose names begin with "FileSystemRepository Workers Thread-") in WAITING or TIMED_WAITING state. This means that they are sitting in the Executor waiting to be scheduled to do something else, so they aren't stuck in any kind of infinite loop or anything like that. Now, with all of that being said, I have a theory as to what could perhaps be happening :) >From the configuration that you listed below, it shows that the content repository is located at ./content_repository, which is the default. Is the FlowFile Repository also located at the default location of ./flowfile_repository? The reason that I ask is this: When I said above that a Resource Claim is marked destructible when no more FlowFiles reference it, that was a bit of a simplification. A more detailed explanation is this: when the FlowFile Repository is checkpointed (this happens every 2 minutes by default), its Write-Ahead Log is "rolled over" (or "checkpointed" or "compacted" or however you like to refer to it). When this happens, we do an fsync() to ensure that the data is stored safely on disk. Only then do we actually mark a claim as destructible. This is done in order to ensure that if there is a power outage and a FlowFile Repository update wasn't completely flushed to disk, that we can recover. For instance, if the content of a FlowFile changes from Resource Claim A to Resource Claim B and as a result we delete Resource Claim A and then lose power, it's possible that the FlowFile Repository didn't flush that update to disk; as a result, on restart, we may still have that FlowFile pointing to Resource Claim A which is now deleted, so we would end up having data loss. This method of only deleting Resource Claims after the FlowFile Repository has been fsync'ed means that we know on restart that Resource Claim A won't still be referenced. So that was probably a very wordy, verbose description of what happens but I'm trying to make sure that I explain things adequately. So with that background... if you are storing your FlowFile Repository on the same volume as your Content Repository, the following could happen: At some point in time, enough data is queued up in your flow for you to run out of disk space. As a result, the FlowFile Repository is unable to be compacted. Since this is not happening, it will not mark any of the Resource Claims as destructible. This would mean that the Content Repository does not get cleaned up. So now you've got a full Content Repository and it's unable to clean up after itself, because no Resource Claims are getting marked as destructible. So to prove or disprove this theory, there are a few things that you can look at: Do you see the following anywhere in your logs: Unable to checkpoint FlowFile Repository If you add the following to your conf/logback.xml: <logger name="org.apache.nifi.controller.repository.claim.StandardResourceClaimManager" level="DEBUG" /> Then that should allow you to see a DEBUG-level log message every time that a Resource Claim is marked destructible and every time that the Content Repository requests the collection of Destructible Claims ("Drained 100 destructable claims" for instance) Any of the logs related to those statements should be very valuable in determining what's going on. Thanks again for all of the detailed analysis. Hopefully we can get this all squared away and taken care of quickly! -Mark On Dec 11, 2016, at 1:21 PM, Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com><mailto: al...@cloudera.com<mailto:al...@cloudera.com>>> wrote: Here is what I have figured out so far. The cleanups are scheduled at https://github.com/apache/nifi /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra mework/nifi-framework-core/src/main/java/org/apache/nifi/con troller/repository/FileSystemRepository.java#L232 I'm not totally sure which one of those is the one that should be cleaning things up. It's either ArchiveOrDestroyDestructableClaims or DestroyExpiredArchiveClaims, both of which are in that class, and both of which are scheduled with scheduleWithFixedDelay. Based on docs at https://docs.oracle.com/javase/7/docs/api/java/util/concurre nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) if those methods fail once, they will stop running forever. Also if the thread got stuck it wouldn't launch a new one. I then hoped I would go into the logs, see a failure, and use it to figure out the issue. What I'm seeing instead is things like this, which comes from BinDestructableClaims: 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Deleted 159 files from archive for Container default; oldest Archive Date is now Sat Dec 10 22:09:53 PST 2016; container cleanup took 34266 millis that are somewhat frequent (as often as once per second, which is the scheduling frequency). Then, eventually, they just stop. Unfortunately there isn't an error message I can find that's killing these. At nifi startup, I see messages like this, which come from something (not sure what yet) calling the cleanup() method on FileSystemRepository: 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository.FileSystemRepository Found unknown file /home/cops/edh-bundle-extracto r/content_repository/0/1481467667784-2048 (1749645 bytes) in File System Repository; removing file I never see those after the initial cleanup that happens on restart. I attached a thread dump. I noticed at the top that there is a cleanup thread parked. I took 10 more thread dumps after this and in every one of them the cleanup thread was parked. That thread looks like it corresponds to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that if the cleanup task I need were running, it would be in one of the FileSystemRepository Workers. However, in all of my thread dumps, these were always all parked. Attached one of the thread dumps. Thanks, Alan On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com>> wrote: Alan, It's possible that you've run into some sort of bug that is preventing it from cleaning up the Content Repository properly. While it's stuck in this state, could you capture a thread dump (bin/nifi.sh dump thread-dump.txt)? That would help us determine if there is something going on that is preventing the cleanup from happening. Thanks -Mark ________________________________ From: Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> Sent: Sunday, December 11, 2016 11:11 AM To: dev@nifi.apache.org<mailto:dev@nifi.apache.org> Subject: Re: Content Repository Cleanup This just filled up again even with nifi.content.repository.archive.enabled=false. On the node that is still alive, our queued flowfiles are 91 / 16.47 GB, but the content repository directory is using 646 GB. Is there a property I can set to make it clean things up more frequently? I expected that once I turned archive enabled off, it would delete things from the content repository as soon as the flow files weren't queued anywhere. So far the only way I have found to reliably get nifi to clear out the content repository is to restart it. Our version string is the following, if that interests you: 11/26/2016 04:39:37 PST Tagged nifi-1.1.0-RC2 >From ${buildRevision} on branch ${buildBranch} Maybe we will go to the released 1.1 and see if that helps. Until then I'll be restarting a lot and digging into the code to figure out where this cleanup is supposed to happen. Any pointers on code/configs for that would be appreciated. Thanks, Alan On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <jgres...@gmail.com <mailto:jgres...@gmail.com>> wrote: No, in my scenario a server restart would not affect the content repository size. On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <al...@cloudera.com <mailto:al...@cloudera.com>> wrote: If we were in the situation Joe G described, should we expect that when we kill and restart nifi it would clean everything up? That behavior has been consistent every time - when the disk hits 100%, we kill nifi, delete enough old content files to bring it back up, and before it bring the UI up it deletes things to get within the archive policy again. That sounds less like the files are stuck and more like it failed trying. For now I just turned off archiving, since we don't really need it for this use case. I attached a jstack from last night's failure, which looks pretty boring to me. On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <al...@cloudera.com <mailto:al...@cloudera.com>> wrote: The scenario Joe G describes is almost exactly what we are doing. We bring in large files and unpack them into many smaller ones. In the most recent iteration of this problem, I saw that we had many small files queued up at the time trouble was happening. We will try your suggestion to see if the situation improves. Thanks, Alan On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <jgres...@gmail.com <mailto:jgres...@gmail.com>> wrote: Not sure if your scenario is related, but one of the NiFi devs recently explained to me that the files in the content repository are actually appended together with other flow file content (please correct me if I'm explaining it wrong). That means if you have many small flow files in your current backlog, and several large flow files have recently left the flow, the large ones could still be hanging around in the content repository as long as the small ones are still there, if they're in the same appended files on disk. This scenario recently happened to us: we had a flow with ~20 million tiny flow files queued up, and at the same time we were also processing a bunch of 1GB files, which left the flow quickly. The content repository was much larger than what was actually being reported in the flow stats, and our disks were almost full. On a hunch, I tried the following strategy: - MergeContent the tiny flow files using flow-file-v3 format (to capture all attributes) - MergeContent 10,000 of the packaged flow files using tar format for easier storage on disk - PutFile into a directory - GetFile from the same directory, but using back pressure from here on out (so that the flow simply wouldn't pull the same files from disk until it was really ready for them) - UnpackContent (untar them) - UnpackContent (turn them back into flow files with the original attributes) - Then do the processing they were originally designed for This had the effect of very quickly reducing the size of my content repository to very nearly the actual size I saw reported in the flow, and my disk usage dropped from ~95% to 50%, which is the configured content repository max usage percentage. I haven't had any problems since. Hope this helps. Joe On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.w...@gmail.com <mailto:joe.w...@gmail.com>> wrote: Alan, That retention percentage only has to do with the archive of data which kicks in once a given chunk of content is no longer reachable by active flowfiles in the flow. For it to grow to 100% typically would mean that you have data backlogged in the flow that account for that much space. If that is certainly not the case for you then we need to dig deeper. If you could do screenshots or share log files and stack dumps around this time those would all be helpful. If the screenshots and such are too sensitive please just share as much as you can. Thanks Joe On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway < al...@cloudera.com<mailto:al...@cloudera.com>> wrote: One other note on this, when it came back up there were tons of messages like this: 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository. FileSystemRepository Found unknown file /path/to/content_repository/49 8/1481329796415-87538 (1071114 bytes) in File System Repository; archiving file I haven't dug into what that means. Alan On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway < al...@cloudera.com<mailto:al...@cloudera.com>> wrote: Hello, We have a node on which nifi content repository keeps growing to use 100% of the disk. It's a relatively high-volume process. It chewed through more than 100GB in the three hours between when we first saw it hit 100% of the disk and when we just cleaned it up again. We are running nifi 1.1 for this. Our nifi.properties looked like this: nifi.content.repository.implementation=org.apache. nifi.controller.repository.FileSystemRepository nifi.content.claim.max.appendable.size=10 MB nifi.content.claim.max.flow.files=100 nifi.content.repository.direct ory.default=./content_repository nifi.content.repository.archive.max.retention.period=12 hours nifi.content.repository.archive.max.usage.percentage=50% nifi.content.repository.archive.enabled=true nifi.content.repository.always.sync=false I just bumped retention period down to 2 hours, but should max usage percentage protect us from using 100% of the disk? Unfortunately we didn't get jstacks on either failure. If it hits 100% again I will make sure to get that. Thanks, Alan -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13* -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13* <thread-dump.txt>