Ryan,

Thanks. So 1.12.0 has no known issues with content repo not being cleaned up 
properly.

As you pointed out, nifi.content.claim.max.appendable.size is intended to cap 
the maximum number of FlowFiles that will be written to a single file. However, 
it does come with a couple of caveats.

(1) Once this cap is reached, it won’t keep adding FlowFiles to the stream, but 
once it starts, it doesn’t spill over to another stream. So, with that set to 1 
MB, you may write 100 FlowFiles, each 4 KB, and then write a 4 MB FlowFile to 
it. So the size will be about 4.4 MB, and it won’t be cleaned up until all 101 
FlowFiles have left your system.

(2) The cap only takes effect between Process Sessions. Meaning, that if you 
have a Processor that processes many FlowFiles in a single session, they can 
all be written to a single file. Generally, this could happen if you set the 
Run Duration to a high value. For example, if Run Duration is set to 1 second 
and you have enough FlowFiles for it to process for a full second, all of those 
FlowFiles could be written to the same file on disk.

Also, of note, the files are only cleaned up when the FlowFile Repository 
checkpoints. This is determined by the 
“nifi.flowfile.repository.checkpoint.interval” property. This defaults to 20 
seconds in 1.12.0 but if you have a larger value there, you may want to 
decrease it.

One thing that might be of interest in understanding why the content claims 
still exist in the repo is to run “bin/nifi.sh diagnostics —verbose 
diagnostics1.txt”
That will write out a file, diagnostics1.txt, that has lots of diagnostics 
information. This includes which FlowFiles are referencing each file in the 
content repository. I.e., which FlowFiles must finish processing before the 
file can be cleaned up.

Hope this helps!
-Mark


On Sep 17, 2020, at 11:07 AM, Ryan Hendrickson 
<ryan.andrew.hendrick...@gmail.com<mailto:ryan.andrew.hendrick...@gmail.com>> 
wrote:

1.12.0

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:04 AM Joe Witt 
<joe.w...@gmail.com<mailto:joe.w...@gmail.com>> wrote:
Ryan

What version are you using? I do think we had an issue that kept items around 
longer than intended that has been addressed.

Thanks

On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson 
<ryan.andrew.hendrick...@gmail.com<mailto:ryan.andrew.hendrick...@gmail.com>> 
wrote:
Hello,
I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB of data 
on my canvas.

However, the content repository (on it's own partition) is completely full with 
350GB of data.  I'm pretty certain the way Content Claims store the data is 
responsible for this.  In previous experience, we've had files that are larger, 
and haven't seen this as much.

My guess is that as data was streaming through and being added to a claim, it 
isn't always released as the small files leaves the canvas.

We've run into this issue enough times that I figure there's probably a "best 
practice for small files" for the content claims settings.

These are our current settings:
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=/var/nifi/repositories/content
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository

There's 1024 folders on the disk (0-1023) for the Content Claims.
Each file inside the folders are roughly  2MB to 8 MB (Which is odd because I 
thought the max appendable size would make this no larger than 1MB.)

Is there a way to expand the number of folders and/or reduce the amount of 
individual FlowFiles that are stored in the claims?

I'm hoping there might be a best practice out there though.

Thanks,
Ryan


Reply via email to