Re: Content Claims Filling Disk - Best practice for small files?

Ryan Hendrickson Fri, 18 Sep 2020 14:43:59 -0700

Mark,
To close the loop on "max.appendable.claim.size", I confirmed it was
removed in 1.12.0, we just didn't remove it from our nifi.properties file.
I've done that now..


For the server with the disk full (/var/nifi/repositories/content and
/var/nifi/repositories/flowfile are both full).. there's lots of content
claims that exist between 5 minute intervals - in fact tons.  The file is
14MB large.
default/483/1600318340509-49635, Claimant Count =0, In Use = false,
Awaiting Destruction = false, References (0) = []

Doing a cat diag | grep " Claimant Count =0, In Use = false, Awaiting
Destruction = false, References (0)" | wc -l  yields
39,219 matching lines..

Could too high of CPU load cause the claims not to be cleaned-up?  Is there
a way to kick-off a manual clean-up?


I've checked on another server with the same setup too, this one doesn't
have that issue:

As for looking a few stats a few minutes apart:
Diag1:
Queued FlowFiles: 57
Queued Bytes: 220412760
Running components: 10

Diag2:
Queued FlowFiles: 4070
Queued Bytes: 285217783
Running components: 11

The Claimant Counts are cleaning up well here.

Thanks,
Ryan

On Thu, Sep 17, 2020 at 3:45 PM Mark Payne <marka...@hotmail.com> wrote:

> Ryan,
>
> OK, thanks. So the “100 based on the max size” is… “fun.” Not entirely
> sure when that property made it into nifi.properties - I’m guessing that
> when the max.appendable.claim.size was added, we intended to also implement
> a max number of FlowFiles. But it was never implemented. So I think this
> probably has never been used and it was just a bug that it ever made it
> into nifi.properties. I think that was actually cleaned up in 1.12.0.
>
> What will be interesting is if you wait, say 5 minutes, and then run the
> diagnostics dump again. Are there any files that previously had a Claimant
> Count of 0 and that had In Use = false, that still exist with a Claimant
> Count of 0? If not, then at least we know that cleanup is working properly.
> If there are, then that would indicate that the content repo is not
> cleaning up properly.
>
>
>
> On Sep 17, 2020, at 3:38 PM, Ryan Hendrickson <
> ryan.andrew.hendrick...@gmail.com> wrote:
>
> A couple things from it:
>
> 1. The sum of the "Claimant counts" equals the number of FlowFiles
> reported on the Canvas.
> 2. None are Awaiting Destruction
> 3. Claimant Count Lowest number is 1 (when it's not zero)
> 4. Claimant Count Highest number is 4,773  (Should this one be 100 based
> on the max size, but maybe not if more than 100 is read in a single
> session?)
> 5. The sum of the "References" is 64,814.
> 6. The lowest Reference is 1 (when it's not zero)
> 7. The highest Reference is 4,773
> 8. Some References have Swap Files (10,006) and others have FlowFiles (470)
> 9. There are 10,468 "In Use"
>
> Anything there stick out to anyone?
>
> Thanks,
> Ryan
>
> On Thu, Sep 17, 2020 at 2:29 PM Ryan Hendrickson <
> ryan.andrew.hendrick...@gmail.com> wrote:
>
>> Correction - it did work.  I was expecting it to be in the same folder as
>> where I ran nifi.sh from, vs NIFI_HOME.
>>
>> Reviewing it now...
>>
>> Ryan
>>
>> On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson <
>> ryan.andrew.hendrick...@gmail.com> wrote:
>>
>>> Hey Mark,
>>> I should have mentioned the PutElasticsearchHttp is going to 2 different
>>> clusters.  We did play with different thread counts for each of them.  At
>>> one point were wondering if too large a Batch Size would make the threads
>>> block each.
>>>
>>> It looks like PutElasticsearchHttp serializes every FlowFile to verify
>>> it's a well-formed JSON document [1].  That alone feels pretty CPU
>>> expensive.. In our case, we know already we have valid JSON.  Just as an
>>> anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
>>> a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
>>> 2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
>>> threads each to keep up.  PutElasticsearchHTTP needs about 10 each.
>>>
>>> PutElasticsearchHTTP is configured like this:
>>> Index: ${esIndex}
>>> Batch Size: 3000
>>> Index Operation: Index
>>>
>>> For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to
>>> export TOOLS_JAR on the command line to the path where tools.jar was
>>> located.
>>>
>>> I'm not getting a file written out though.  I still have the "full" NiFi
>>> up and running.  I assume that should be?  Do I need to change my
>>> logback.xml levels at all?
>>>
>>>
>>> [1]
>>> https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299
>>>
>>> Thanks,
>>> Ryan
>>>
>>> On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <marka...@hotmail.com>
>>> wrote:
>>>
>>>> Ryan,
>>>>
>>>> Why are you using DistributeLoad to go to two different
>>>> PutElasticsearchHttp processors? Does that perform better for you than a
>>>> single PutElasticsearchHttp processors with multiple concurrent tasks? It
>>>> shouldn’t really. I’ve never used that processor, but if two instances of
>>>> the processor perform significantly better than 1 instance with 2
>>>> concurrent tasks, that’s probably worth looking into.
>>>>
>>>> -Mark
>>>>
>>>>
>>>> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
>>>> ryan.andrew.hendrick...@gmail.com> wrote:
>>>>
>>>> @Joe I can't export the flow.xml.gz easily, although it's pretty
>>>> simple.  We put just the following on it's own server because
>>>> DistributeLoad (bug [1]) and PutElasticsearchHttp have a hard time keeping
>>>> up.
>>>>
>>>>    1. Input Port
>>>>    2. ControlRate (data rate | 1.7GB | 5 min)
>>>>    3. Update Attributes (Delete Attribute Regex)
>>>>    4. JoltTransformJSON
>>>>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns
>>>>    it into Objects)
>>>>    6. DistributeLoad
>>>>       1. PutElasticsearchHttp
>>>>       2. PutElasticsearchHttp
>>>>
>>>>
>>>> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo
>>>> to see if that's more performant than PutElasticsearchHttp.. The Elastic
>>>> one uses an ObjectMapper, and string replacements, etc.  It seems to cap
>>>> out around 2-3GB/5 minutes
>>>>
>>>> @Mark I'll check the diagnostics.
>>>>
>>>> @Jim definitely disk space 100% used.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/NIFI-1121
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <
>>>> jwilli...@alertlogic.com> wrote:
>>>>
>>>>> Ryan,
>>>>>
>>>>>
>>>>>
>>>>> Is this this maybe a case of exhausting inodes on the filesystem
>>>>> rather than exhausting the space available?  If you do a ‘df -I’ on the
>>>>> system what do you see for inode usage?
>>>>>
>>>>>
>>>>>
>>>>> Warm regards,
>>>>>
>>>>>
>>>>>
>>>>> <image001.jpg> <https://www.alertlogic.com/>
>>>>>
>>>>> *Jim Williams* | Manager, Site Reliability Engineering
>>>>>
>>>>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com |
>>>>> alertlogic.com <http://www.alertlogic.com/> <image002.png>
>>>>> <https://twitter.com/alertlogic><image003.png>
>>>>> <https://www.linkedin.com/company/alert-logic>
>>>>>
>>>>>
>>>>>
>>>>> <image004.png>
>>>>>
>>>>>
>>>>>
>>>>> *From:* Joe Witt <joe.w...@gmail.com>
>>>>> *Sent:* Thursday, September 17, 2020 10:19 AM
>>>>> *To:* users@nifi.apache.org
>>>>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>>>>> files?
>>>>>
>>>>>
>>>>>
>>>>> can you share your flow.xml.gz?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>>>>> ryan.andrew.hendrick...@gmail.com> wrote:
>>>>>
>>>>> 1.12.0
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <joe.w...@gmail.com> wrote:
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> What version are you using? I do think we had an issue that kept items
>>>>> around longer than intended that has been addressed.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>>>>> ryan.andrew.hendrick...@gmail.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about
>>>>> 55GB of data on my canvas.
>>>>>
>>>>>
>>>>>
>>>>> However, the content repository (on it's own partition) is
>>>>> completely full with 350GB of data.  I'm pretty certain the way Content
>>>>> Claims store the data is responsible for this.  In previous experience,
>>>>> we've had files that are larger, and haven't seen this as much.
>>>>>
>>>>>
>>>>>
>>>>> My guess is that as data was streaming through and being added to a
>>>>> claim, it isn't always released as the small files leaves the canvas.
>>>>>
>>>>>
>>>>>
>>>>> We've run into this issue enough times that I figure there's probably
>>>>> a "best practice for small files" for the content claims settings.
>>>>>
>>>>>
>>>>>
>>>>> These are our current settings:
>>>>>
>>>>>
>>>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>>>>
>>>>> nifi.content.claim.max.appendable.size=1 MB
>>>>>
>>>>> nifi.content.claim.max.flow.files=100
>>>>>
>>>>>
>>>>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>>>>
>>>>> nifi.content.repository.archive.max.retention.period=12 hours
>>>>>
>>>>> nifi.content.repository.archive.max.usage.percentage=50%
>>>>>
>>>>> nifi.content.repository.archive.enabled=true
>>>>>
>>>>> nifi.content.repository.always.sync=false
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>>>>
>>>>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>>>>> because I thought the max appendable size would make this no larger than
>>>>> 1MB.)
>>>>>
>>>>>
>>>>>
>>>>> Is there a way to expand the number of folders and/or reduce the
>>>>> amount of individual FlowFiles that are stored in the claims?
>>>>>
>>>>>
>>>>>
>>>>> I'm hoping there might be a best practice out there though.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> Confidentiality Notice | This email and any included attachments may
>>>>> be privileged, confidential and/or otherwise protected from disclosure.
>>>>> Access to this email by anyone other than the intended recipient is
>>>>> unauthorized. If you believe you have received this email in error, please
>>>>> contact the sender immediately and delete all copies. If you are not the
>>>>> intended recipient, you are notified that disclosing, copying, 
>>>>> distributing
>>>>> or taking any action in reliance on the contents of this information is
>>>>> strictly prohibited.
>>>>>
>>>>> *Disclaimer*
>>>>>
>>>>> The information contained in this communication from the sender is
>>>>> confidential. It is intended solely for use by the recipient and others
>>>>> authorized to receive it. If you are not the recipient, you are hereby
>>>>> notified that any disclosure, copying, distribution or taking action in
>>>>> relation of the contents of this information is strictly prohibited and 
>>>>> may
>>>>> be unlawful.
>>>>>
>>>>> This email has been scanned for viruses and malware, and may have been
>>>>> automatically archived by Mimecast, a leader in email security and cyber
>>>>> resilience. Mimecast integrates email defenses with brand protection,
>>>>> security awareness training, web security, compliance and other essential
>>>>> capabilities. Mimecast helps protect large and small organizations from
>>>>> malicious activity, human error and technology failure; and to lead the
>>>>> movement toward building a more resilient world. To find out more, visit
>>>>> our website.
>>>>>
>>>>
>>>>
>

Re: Content Claims Filling Disk - Best practice for small files?

Reply via email to