Update: session.remove(newFiles) does not work. I filed https://issues.apache.org/jira/browse/NIFI-3205
On Thu, Dec 15, 2016 at 11:05 AM, Alan Jackoway <al...@cloudera.com> wrote: > I am getting the successfully checkpointed message. > > I think I figured this out. Now we have to decide whether it's an issue in > nifi or an issue in our code. > > This flow has a process that takes large zip files, unzips them, and does > some processing of the files. I noticed that the disk space thing seems to > go up fastest when there is a large file that is failing in the middle of > one of these steps. I then suspected that something about the way we were > creating new flow files out of the zips was the problem. > > I simulated what we were doing with the following processor in a new nifi > 1.1. The processor takes an input file, copies it 5 times (to simulate > unzip / process), then throws a runtime exception. I then wired a > GenerateFlowFile of 100KB to it. I noticed the following characteristics: > * Each time it ran, the size of the content repository went up exactly > 500KB. > * When I restarted the nifi, I got the messages about unknown files in the > FileSystemRepository. > > So basically what this boils down to is: who is responsible to remove > files from a session when a failure occurs? Should we be doing that (I will > test next that calling session.remove before the error fixes the problem) > or should the session keep track of new flow files that it created. We > assumed the session would do so because the session yells at us if we fail > to give a transport relationship for one of the files. > > Thanks for all the help with this. I think we are closing in on the point > where I have either a fix or a bug filed or both. > > Test processor I used: > // Copyright 2016 (c) Cloudera > package com.cloudera.edh.nifi.processors.bundles; > > import com.google.common.collect.Lists; > > import java.io.IOException; > import java.io.InputStream; > import java.io.OutputStream; > import java.util.List; > > import org.apache.nifi.annotation.behavior.InputRequirement; > import org.apache.nifi.annotation.behavior.InputRequirement.Requirement; > import org.apache.nifi.flowfile.FlowFile; > import org.apache.nifi.processor.AbstractProcessor; > import org.apache.nifi.processor.ProcessContext; > import org.apache.nifi.processor.ProcessSession; > import org.apache.nifi.processor.exception.ProcessException; > import org.apache.nifi.processor.io.InputStreamCallback; > import org.apache.nifi.processor.io.OutputStreamCallback; > import org.apache.nifi.stream.io.StreamUtils; > > /** > * Makes 5 copies of an incoming file, then fails and rolls back. > */ > @InputRequirement(value = Requirement.INPUT_REQUIRED) > public class CopyAndFail extends AbstractProcessor { > @Override > public void onTrigger(ProcessContext context, ProcessSession session) > throws ProcessException { > FlowFile inputFile = session.get(); > if (inputFile == null) { > context.yield(); > return; > } > final List<FlowFile> newFiles = Lists.newArrayList(); > > // Copy the file 5 times (simulates us opening a zip file and > unpacking its contents) > for (int i = 0; i < 5; i++) { > session.read(inputFile, new InputStreamCallback() { > @Override > public void process(InputStream inputStream) throws IOException { > FlowFile ff = session.create(inputFile); > ff = session.write(ff, new OutputStreamCallback() { > @Override > public void process(final OutputStream out) throws IOException > { > StreamUtils.copy(inputStream, out); > } > }); > newFiles.add(ff); > } > }); > } > > // THIS IS WHERE I WILL PUT session.remove TO VERIFY THAT WORKS > > // Simulate an error handling some file in the zip after unpacking the > rest > throw new RuntimeException(); > } > } > > > On Wed, Dec 14, 2016 at 9:23 PM, Mark Payne <marka...@hotmail.com> wrote: > >> I'd be very curious to see if changing the limits addresses the issue. >> The OOME can certainly be an issue, as well. Once that gets thrown >> anywhere >> in the JVM, it's hard to vouch for the stability of the JVM at all. >> >> Seeing the claimant count drop to 0 then back up to 1, 2, and down to 1, >> 0 again >> is pretty common. The fact that you didn't see it marked as destructible >> is interesting. >> Around that same time, are you seeing log messages indicating that the >> FlowFile repo >> is checkpointing? Would have the words "Successfully checkpointed >> FlowFile Repository" >> That should happen every 2 minutes, approximately. >> >> >> On Dec 14, 2016, at 8:56 PM, Alan Jackoway <al...@cloudera.com<mailto:ala >> n...@cloudera.com>> wrote: >> >> I agree the limits sound low and will address that tomorrow. >> >> I'm not seeing FileNotFound or NoSuchFile. >> >> Here's an example file: >> grep 1481763927251 logs/nifi-app.log >> 2016-12-14 17:05:27,277 DEBUG [Timer-Driven Process Thread-36] >> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 1 >> 2016-12-14 17:05:27,357 DEBUG [Timer-Driven Process Thread-2] >> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 2 >> 2016-12-14 17:05:27,684 DEBUG [Timer-Driven Process Thread-36] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 1 >> 2016-12-14 17:05:27,732 DEBUG [Timer-Driven Process Thread-2] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 0 >> 2016-12-14 17:05:27,909 DEBUG [Timer-Driven Process Thread-14] >> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 1 >> 2016-12-14 17:05:27,945 DEBUG [Timer-Driven Process Thread-14] >> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 2 >> 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 1 >> 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >> to 0 >> >> This nifi-app.log covers a period when the nifi only handled two sets of >> files for a total of maybe 10GB uncompressed. Content repository went over >> 100GB in that time. I checked a few content repository files, and they all >> had similar patterns - claims hit 0 twice - once around 17:05 and once >> around 17:14, then nothing. I brought down the nifi around 17:30. >> >> During that time, we did have a processor hitting OutOfMemory while >> unpacking a 1GB file. I'm adjusting the heap to try to make that succeed >> in >> case that was related. >> >> On Wed, Dec 14, 2016 at 8:32 PM, Mark Payne <marka...@hotmail.com<mailto: >> marka...@hotmail.com>> wrote: >> >> OK, so these are generally the default values for most linux systems. >> These are a little low, >> though for what NiFi recommends and often needs. With these settings, you >> can easily run >> out of open file handles. When this happens, trying to access a file will >> return a FileNotFoundException >> even though the file exists and permissions all look good. As a result, >> NiFi may be failing to >> delete the data simply because it can't get an open file handle. >> >> The admin guide [1] explains the best practices for configuring these >> settings. Generally, after updating >> these settings, I think you have to logout of the machine and login again >> for the changes to take effect. >> Would recommend you update these settings and also search logs for >> "FileNotFound" as well as >> "NoSuchFile" and see if that hits anywhere. >> >> >> [1] http://nifi.apache.org/docs/nifi-docs/html/administration- >> guide.html#configuration-best-practices >> >> >> On Dec 14, 2016, at 8:25 PM, Alan Jackoway <al...@cloudera.com<mailto:ala >> n...@cloudera.com><mailto:ala >> n...@cloudera.com<mailto:n...@cloudera.com>>> wrote: >> >> We haven't let the disk hit 100% in a while, but it's been crossing 90%. >> We >> haven't seen the "Unable to checkpoint" message in the last 24 hours. >> >> $ ulimit -Hn >> 4096 >> $ ulimit -Sn >> 1024 >> >> I will work on tracking a specific file next. >> >> >> On Wed, Dec 14, 2016 at 8:17 PM, Alan Jackoway <al...@cloudera.com >> <mailto:al...@cloudera.com><mailto: >> al...@cloudera.com<mailto:al...@cloudera.com>>> wrote: >> >> At first I thought that the drained messages always said 0, but that's not >> right. What should the total number of claims drained be? The number of >> flowfiles that made it through the system? If so, I think our number is >> low: >> >> $ grep "StandardResourceClaimManager Drained" nifi-app_2016-12-14* | grep >> -v "Drained 0" | awk '{sum += $9} END {print sum}' >> 25296 >> >> I'm not sure how to get the count of flowfiles that moved through, but I >> suspect that's low by an order of magnitude. That instance of nifi has >> handled 150k files in the last 6 hours, most of which went through a >> number >> of processors and transformations. >> >> Should the number of drained claims correspond to the number of flow files >> that moved through the system? >> Alan >> >> On Wed, Dec 14, 2016 at 6:59 PM, Alan Jackoway <al...@cloudera.com >> <mailto:al...@cloudera.com><mailto: >> al...@cloudera.com<mailto:al...@cloudera.com>>> wrote: >> >> Some updates: >> * We fixed the issue with missing transfer relationships, and this did >> not go away. >> * We saw this a few minutes ago when the queue was at 0. >> >> What should I be looking for in the logs to figure out the issue? >> >> Thanks, >> Alan >> >> On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <al...@cloudera.com >> <mailto:al...@cloudera.com> >> <mailto:al...@cloudera.com>> >> wrote: >> >> In case this is interesting, I think this started getting bad when we >> started hitting an error where some of our files were not given a transfer >> relationship. Maybe some combination of not giving flow files a >> relationship and the subsequent penalization is causing the problem. >> >> On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <al...@cloudera.com >> <mailto:al...@cloudera.com> >> <mailto:al...@cloudera.com>> >> wrote: >> >> Everything is at the default locations for these nifis. >> >> On one of the two machines, I did find log messages like you suggested: >> 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1] >> o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile >> Repository due to java.io.FileNotFoundException: >> ./flowfile_repository/partition-14/3169.journal (No space left on >> device) >> >> I added the logger, which apparently takes effect right away. What am I >> looking for in this logs? I see a lot of stuff like: >> 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >> for StandardResourceClaim[id=1481555893660-3174, container=default, >> section=102] to 0 >> 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >> for StandardResourceClaim[id=1481555922818-3275, container=default, >> section=203] to 191 >> 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >> for StandardResourceClaim[id=1481555880393-3151, container=default, >> section=79] to 142 >> 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >> for StandardResourceClaim[id=1481555872053-3146, container=default, >> section=74] to 441 >> 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >> for StandardResourceClaim[id=1481555893954-3178, container=default, >> section=106] to 2 >> 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >> for StandardResourceClaim[id=1481555893696-3175, container=default, >> section=103] to 1 >> 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1] >> o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims >> to [] >> >> What's puzzling to me is that both of these machines have > 100GB of >> free space, and I have never seen the queued size go above 20GB. It seems >> to me like it gets into a state where nothing is deleted long before it >> runs out of disk space. >> >> Thanks, >> Alan >> >> On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <marka...@hotmail.com<mailto: >> marka...@hotmail.com><mailto:m >> arka...@hotmail.com<mailto:arka...@hotmail.com>>> >> wrote: >> >> Alan, >> >> Thanks for the thread-dump and the in-depth analysis! >> >> So in terms of the two tasks there, here's a quick explanation of what >> each does: >> ArchiveOrDestroyDestructableClaims - When a Resource Claim (which >> maps to a file on disk) is no longer referenced >> by any FlowFile, it can be either archived or destroyed (depending on >> whether the property in nifi.properties has archiving >> enabled). >> DestroyExpiredArchiveClaims - When archiving is enabled, the Resource >> Claims that are archived have to eventually >> age off. This task is responsible for ensuring that this happens. >> >> As you mentioned, in the Executor, if the Runnable fails it will stop >> running forever, and if the thread gets stuck, another will >> not be launched. Neither of these appears to be the case. I say this >> because both of those Runnables are wrapped entirely >> within a try { ... } catch (Throwable t) {...}. So the method will >> never end Exceptionally. Also, the thread dump shows all of the >> threads created by that Thread Pool (those whose names begin with >> "FileSystemRepository Workers Thread-") in WAITING >> or TIMED_WAITING state. This means that they are sitting in the >> Executor waiting to be scheduled to do something else, >> so they aren't stuck in any kind of infinite loop or anything like >> that. >> >> Now, with all of that being said, I have a theory as to what could >> perhaps be happening :) >> >> From the configuration that you listed below, it shows that the >> content repository is located at ./content_repository, which is >> the default. Is the FlowFile Repository also located at the default >> location of ./flowfile_repository? The reason that I ask is this: >> >> When I said above that a Resource Claim is marked destructible when no >> more FlowFiles reference it, that was a bit of a >> simplification. A more detailed explanation is this: when the FlowFile >> Repository is checkpointed (this happens every 2 minutes >> by default), its Write-Ahead Log is "rolled over" (or "checkpointed" >> or "compacted" or however you like to refer to it). When this >> happens, we do an fsync() to ensure that the data is stored safely on >> disk. Only then do we actually mark a claim as destructible. >> This is done in order to ensure that if there is a power outage and a >> FlowFile Repository update wasn't completely flushed to disk, >> that we can recover. For instance, if the content of a FlowFile >> changes from Resource Claim A to Resource Claim B and as a result >> we delete Resource Claim A and then lose power, it's possible that the >> FlowFile Repository didn't flush that update to disk; as a result, >> on restart, we may still have that FlowFile pointing to Resource Claim >> A which is now deleted, so we would end up having data loss. >> This method of only deleting Resource Claims after the FlowFile >> Repository has been fsync'ed means that we know on restart that >> Resource Claim A won't still be referenced. >> >> So that was probably a very wordy, verbose description of what happens >> but I'm trying to make sure that I explain things adequately. >> So with that background... if you are storing your FlowFile Repository >> on the same volume as your Content Repository, the following >> could happen: >> >> At some point in time, enough data is queued up in your flow for you >> to run out of disk space. As a result, the FlowFile Repository is >> unable to be compacted. Since this is not happening, it will not mark >> any of the Resource Claims as destructible. This would mean that >> the Content Repository does not get cleaned up. So now you've got a >> full Content Repository and it's unable to clean up after itself, because >> no Resource Claims are getting marked as destructible. >> >> So to prove or disprove this theory, there are a few things that you >> can look at: >> >> Do you see the following anywhere in your logs: Unable to checkpoint >> FlowFile Repository >> >> If you add the following to your conf/logback.xml: >> <logger name="org.apache.nifi.controller.repository.claim. >> StandardResourceClaimManager" >> level="DEBUG" /> >> Then that should allow you to see a DEBUG-level log message every time >> that a Resource Claim is marked destructible and every time >> that the Content Repository requests the collection of Destructible >> Claims ("Drained 100 destructable claims" for instance) >> >> Any of the logs related to those statements should be very valuable in >> determining what's going on. >> >> Thanks again for all of the detailed analysis. Hopefully we can get >> this all squared away and taken care of quickly! >> >> -Mark >> >> >> On Dec 11, 2016, at 1:21 PM, Alan Jackoway <al...@cloudera.com<mailto:ala >> n...@cloudera.com><mailto:ala >> n...@cloudera.com<mailto:n...@cloudera.com>><mailto: >> al...@cloudera.com<mailto:al...@cloudera.com><mailto:al...@cloudera.com>>> >> wrote: >> >> Here is what I have figured out so far. >> >> The cleanups are scheduled at https://github.com/apache/nifi >> /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra >> mework/nifi-framework-core/src/main/java/org/apache/nifi/con >> troller/repository/FileSystemRepository.java#L232 >> >> I'm not totally sure which one of those is the one that should be >> cleaning things up. It's either ArchiveOrDestroyDestructableClaims or >> DestroyExpiredArchiveClaims, both of which are in that class, and both of >> which are scheduled with scheduleWithFixedDelay. Based on docs at >> https://docs.oracle.com/javase/7/docs/api/java/util/concurre >> nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j >> ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) >> if those methods fail once, they will stop running forever. Also if the >> thread got stuck it wouldn't launch a new one. >> >> I then hoped I would go into the logs, see a failure, and use it to >> figure out the issue. >> >> What I'm seeing instead is things like this, which comes from >> BinDestructableClaims: >> 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default] >> o.a.n.c.repository.FileSystemRepository Deleted 159 files from >> archive for Container default; oldest Archive Date is now Sat Dec 10 >> 22:09:53 PST 2016; container cleanup took 34266 millis >> that are somewhat frequent (as often as once per second, which is the >> scheduling frequency). Then, eventually, they just stop. Unfortunately >> there isn't an error message I can find that's killing these. >> >> At nifi startup, I see messages like this, which come from something >> (not sure what yet) calling the cleanup() method on FileSystemRepository: >> 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository. >> FileSystemRepository >> Found unknown file /home/cops/edh-bundle-extracto >> r/content_repository/0/1481467667784-2048 (1749645 bytes) in File >> System Repository; removing file >> I never see those after the initial cleanup that happens on restart. >> >> I attached a thread dump. I noticed at the top that there is a cleanup >> thread parked. I took 10 more thread dumps after this and in every one of >> them the cleanup thread was parked. That thread looks like it corresponds >> to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that >> if the cleanup task I need were running, it would be in one of the >> FileSystemRepository Workers. However, in all of my thread dumps, these >> were always all parked. >> >> Attached one of the thread dumps. >> >> Thanks, >> Alan >> >> >> On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <marka...@hotmail.com >> <mailto:marka...@hotmail.com>> wrote: >> Alan, >> >> >> It's possible that you've run into some sort of bug that is preventing >> >> it from cleaning up the Content Repository properly. While it's stuck >> >> in this state, could you capture a thread dump (bin/nifi.sh dump >> thread-dump.txt)? >> >> That would help us determine if there is something going on that is >> >> preventing the cleanup from happening. >> >> >> Thanks >> >> -Mark >> >> >> ________________________________ >> From: Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> >> Sent: Sunday, December 11, 2016 11:11 AM >> To: dev@nifi.apache.org<mailto:dev@nifi.apache.org> >> Subject: Re: Content Repository Cleanup >> >> This just filled up again even >> with nifi.content.repository.archive.enabled=false. >> >> On the node that is still alive, our queued flowfiles are 91 / 16.47 >> GB, >> but the content repository directory is using 646 GB. >> >> Is there a property I can set to make it clean things up more >> frequently? I >> expected that once I turned archive enabled off, it would delete things >> from the content repository as soon as the flow files weren't queued >> anywhere. So far the only way I have found to reliably get nifi to >> clear >> out the content repository is to restart it. >> >> Our version string is the following, if that interests you: >> 11/26/2016 04:39:37 PST >> Tagged nifi-1.1.0-RC2 >> From ${buildRevision} on branch ${buildBranch} >> >> Maybe we will go to the released 1.1 and see if that helps. Until then >> I'll >> be restarting a lot and digging into the code to figure out where this >> cleanup is supposed to happen. Any pointers on code/configs for that >> would >> be appreciated. >> >> Thanks, >> Alan >> >> On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <jgres...@gmail.com >> <mailto:jgres...@gmail.com>> wrote: >> >> No, in my scenario a server restart would not affect the content >> repository >> size. >> >> On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <al...@cloudera.com >> <mailto:al...@cloudera.com>> wrote: >> >> If we were in the situation Joe G described, should we expect that >> when >> we >> kill and restart nifi it would clean everything up? That behavior >> has >> been >> consistent every time - when the disk hits 100%, we kill nifi, >> delete >> enough old content files to bring it back up, and before it bring >> the UI >> up >> it deletes things to get within the archive policy again. That >> sounds >> less >> like the files are stuck and more like it failed trying. >> >> For now I just turned off archiving, since we don't really need it >> for >> this use case. >> >> I attached a jstack from last night's failure, which looks pretty >> boring >> to me. >> >> On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <al...@cloudera.com >> <mailto:al...@cloudera.com>> >> wrote: >> >> The scenario Joe G describes is almost exactly what we are doing. >> We >> bring in large files and unpack them into many smaller ones. In >> the most >> recent iteration of this problem, I saw that we had many small >> files >> queued >> up at the time trouble was happening. We will try your suggestion >> to >> see if >> the situation improves. >> >> Thanks, >> Alan >> >> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <jgres...@gmail.com >> <mailto:jgres...@gmail.com>> >> wrote: >> >> Not sure if your scenario is related, but one of the NiFi devs >> recently >> explained to me that the files in the content repository are >> actually >> appended together with other flow file content (please correct >> me if >> I'm >> explaining it wrong). That means if you have many small flow >> files in >> your >> current backlog, and several large flow files have recently left >> the >> flow, >> the large ones could still be hanging around in the content >> repository >> as >> long as the small ones are still there, if they're in the same >> appended >> files on disk. >> >> This scenario recently happened to us: we had a flow with ~20 >> million >> tiny >> flow files queued up, and at the same time we were also >> processing a >> bunch >> of 1GB files, which left the flow quickly. The content >> repository was >> much >> larger than what was actually being reported in the flow stats, >> and our >> disks were almost full. On a hunch, I tried the following >> strategy: >> - MergeContent the tiny flow files using flow-file-v3 format (to >> capture >> all attributes) >> - MergeContent 10,000 of the packaged flow files using tar >> format for >> easier storage on disk >> - PutFile into a directory >> - GetFile from the same directory, but using back pressure from >> here on >> out >> (so that the flow simply wouldn't pull the same files from disk >> until >> it >> was really ready for them) >> - UnpackContent (untar them) >> - UnpackContent (turn them back into flow files with the original >> attributes) >> - Then do the processing they were originally designed for >> >> This had the effect of very quickly reducing the size of my >> content >> repository to very nearly the actual size I saw reported in the >> flow, >> and >> my disk usage dropped from ~95% to 50%, which is the configured >> content >> repository max usage percentage. I haven't had any problems >> since. >> >> Hope this helps. >> Joe >> >> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.w...@gmail.com >> <mailto:joe.w...@gmail.com>> wrote: >> >> Alan, >> >> That retention percentage only has to do with the archive of >> data >> which kicks in once a given chunk of content is no longer >> reachable >> by >> active flowfiles in the flow. For it to grow to 100% >> typically would >> mean that you have data backlogged in the flow that account >> for that >> much space. If that is certainly not the case for you then we >> need >> to >> dig deeper. If you could do screenshots or share log files >> and stack >> dumps around this time those would all be helpful. If the >> screenshots >> and such are too sensitive please just share as much as you >> can. >> >> Thanks >> Joe >> >> On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway < >> al...@cloudera.com<mailto:al...@cloudera.com>> >> wrote: >> One other note on this, when it came back up there were tons >> of >> messages >> like this: >> >> 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository. >> FileSystemRepository >> Found unknown file /path/to/content_repository/49 >> 8/1481329796415-87538 >> (1071114 bytes) in File System Repository; archiving file >> >> I haven't dug into what that means. >> Alan >> >> On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway < >> al...@cloudera.com<mailto:al...@cloudera.com>> >> wrote: >> >> Hello, >> >> We have a node on which nifi content repository keeps >> growing to >> use >> 100% >> of the disk. It's a relatively high-volume process. It >> chewed >> through >> more >> than 100GB in the three hours between when we first saw it >> hit >> 100% >> of >> the >> disk and when we just cleaned it up again. >> >> We are running nifi 1.1 for this. Our nifi.properties >> looked like >> this: >> >> nifi.content.repository.implementation=org.apache. >> nifi.controller.repository.FileSystemRepository >> nifi.content.claim.max.appendable.size=10 MB >> nifi.content.claim.max.flow.files=100 >> nifi.content.repository.direct >> ory.default=./content_repository >> nifi.content.repository.archive.max.retention.period=12 >> hours >> nifi.content.repository.archive.max.usage.percentage=50% >> nifi.content.repository.archive.enabled=true >> nifi.content.repository.always.sync=false >> >> I just bumped retention period down to 2 hours, but should >> max >> usage >> percentage protect us from using 100% of the disk? >> >> Unfortunately we didn't get jstacks on either failure. If >> it hits >> 100% >> again I will make sure to get that. >> >> Thanks, >> Alan >> >> >> >> >> >> -- >> I know what it is to be in need, and I know what it is to have >> plenty. I >> have learned the secret of being content in any and every >> situation, >> whether well fed or hungry, whether living in plenty or in >> want. I can >> do >> all this through him who gives me strength. *-Philippians >> 4:12-13* >> >> >> >> >> >> >> -- >> I know what it is to be in need, and I know what it is to have >> plenty. I >> have learned the secret of being content in any and every situation, >> whether well fed or hungry, whether living in plenty or in want. I >> can do >> all this through him who gives me strength. *-Philippians 4:12-13* >> >> >> <thread-dump.txt> >> >> >> >> >> >> >> >> >> >> >