Mark, To close the loop on "max.appendable.claim.size", I confirmed it was removed in 1.12.0, we just didn't remove it from our nifi.properties file. I've done that now..
For the server with the disk full (/var/nifi/repositories/content and /var/nifi/repositories/flowfile are both full).. there's lots of content claims that exist between 5 minute intervals - in fact tons. The file is 14MB large. default/483/1600318340509-49635, Claimant Count =0, In Use = false, Awaiting Destruction = false, References (0) = [] Doing a cat diag | grep " Claimant Count =0, In Use = false, Awaiting Destruction = false, References (0)" | wc -l yields 39,219 matching lines.. Could too high of CPU load cause the claims not to be cleaned-up? Is there a way to kick-off a manual clean-up? I've checked on another server with the same setup too, this one doesn't have that issue: As for looking a few stats a few minutes apart: Diag1: Queued FlowFiles: 57 Queued Bytes: 220412760 Running components: 10 Diag2: Queued FlowFiles: 4070 Queued Bytes: 285217783 Running components: 11 The Claimant Counts are cleaning up well here. Thanks, Ryan On Thu, Sep 17, 2020 at 3:45 PM Mark Payne <marka...@hotmail.com> wrote: > Ryan, > > OK, thanks. So the “100 based on the max size” is… “fun.” Not entirely > sure when that property made it into nifi.properties - I’m guessing that > when the max.appendable.claim.size was added, we intended to also implement > a max number of FlowFiles. But it was never implemented. So I think this > probably has never been used and it was just a bug that it ever made it > into nifi.properties. I think that was actually cleaned up in 1.12.0. > > What will be interesting is if you wait, say 5 minutes, and then run the > diagnostics dump again. Are there any files that previously had a Claimant > Count of 0 and that had In Use = false, that still exist with a Claimant > Count of 0? If not, then at least we know that cleanup is working properly. > If there are, then that would indicate that the content repo is not > cleaning up properly. > > > > On Sep 17, 2020, at 3:38 PM, Ryan Hendrickson < > ryan.andrew.hendrick...@gmail.com> wrote: > > A couple things from it: > > 1. The sum of the "Claimant counts" equals the number of FlowFiles > reported on the Canvas. > 2. None are Awaiting Destruction > 3. Claimant Count Lowest number is 1 (when it's not zero) > 4. Claimant Count Highest number is 4,773 (Should this one be 100 based > on the max size, but maybe not if more than 100 is read in a single > session?) > 5. The sum of the "References" is 64,814. > 6. The lowest Reference is 1 (when it's not zero) > 7. The highest Reference is 4,773 > 8. Some References have Swap Files (10,006) and others have FlowFiles (470) > 9. There are 10,468 "In Use" > > Anything there stick out to anyone? > > Thanks, > Ryan > > On Thu, Sep 17, 2020 at 2:29 PM Ryan Hendrickson < > ryan.andrew.hendrick...@gmail.com> wrote: > >> Correction - it did work. I was expecting it to be in the same folder as >> where I ran nifi.sh from, vs NIFI_HOME. >> >> Reviewing it now... >> >> Ryan >> >> On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson < >> ryan.andrew.hendrick...@gmail.com> wrote: >> >>> Hey Mark, >>> I should have mentioned the PutElasticsearchHttp is going to 2 different >>> clusters. We did play with different thread counts for each of them. At >>> one point were wondering if too large a Batch Size would make the threads >>> block each. >>> >>> It looks like PutElasticsearchHttp serializes every FlowFile to verify >>> it's a well-formed JSON document [1]. That alone feels pretty CPU >>> expensive.. In our case, we know already we have valid JSON. Just as an >>> anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses >>> a total of 9 threads to accomplish the same thing that [2x DistributeLoad + >>> 2x PutElasticsearchHTTP] does with 50 threads. DistributeLoad's need 5 >>> threads each to keep up. PutElasticsearchHTTP needs about 10 each. >>> >>> PutElasticsearchHTTP is configured like this: >>> Index: ${esIndex} >>> Batch Size: 3000 >>> Index Operation: Index >>> >>> For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to >>> export TOOLS_JAR on the command line to the path where tools.jar was >>> located. >>> >>> I'm not getting a file written out though. I still have the "full" NiFi >>> up and running. I assume that should be? Do I need to change my >>> logback.xml levels at all? >>> >>> >>> [1] >>> https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299 >>> >>> Thanks, >>> Ryan >>> >>> On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <marka...@hotmail.com> >>> wrote: >>> >>>> Ryan, >>>> >>>> Why are you using DistributeLoad to go to two different >>>> PutElasticsearchHttp processors? Does that perform better for you than a >>>> single PutElasticsearchHttp processors with multiple concurrent tasks? It >>>> shouldn’t really. I’ve never used that processor, but if two instances of >>>> the processor perform significantly better than 1 instance with 2 >>>> concurrent tasks, that’s probably worth looking into. >>>> >>>> -Mark >>>> >>>> >>>> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson < >>>> ryan.andrew.hendrick...@gmail.com> wrote: >>>> >>>> @Joe I can't export the flow.xml.gz easily, although it's pretty >>>> simple. We put just the following on it's own server because >>>> DistributeLoad (bug [1]) and PutElasticsearchHttp have a hard time keeping >>>> up. >>>> >>>> 1. Input Port >>>> 2. ControlRate (data rate | 1.7GB | 5 min) >>>> 3. Update Attributes (Delete Attribute Regex) >>>> 4. JoltTransformJSON >>>> 5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns >>>> it into Objects) >>>> 6. DistributeLoad >>>> 1. PutElasticsearchHttp >>>> 2. PutElasticsearchHttp >>>> >>>> >>>> Unrelated.. We're experimenting with a MergeContent + InvokeHTTP combo >>>> to see if that's more performant than PutElasticsearchHttp.. The Elastic >>>> one uses an ObjectMapper, and string replacements, etc. It seems to cap >>>> out around 2-3GB/5 minutes >>>> >>>> @Mark I'll check the diagnostics. >>>> >>>> @Jim definitely disk space 100% used. >>>> >>>> [1] https://issues.apache.org/jira/browse/NIFI-1121 >>>> >>>> Ryan >>>> >>>> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim < >>>> jwilli...@alertlogic.com> wrote: >>>> >>>>> Ryan, >>>>> >>>>> >>>>> >>>>> Is this this maybe a case of exhausting inodes on the filesystem >>>>> rather than exhausting the space available? If you do a ‘df -I’ on the >>>>> system what do you see for inode usage? >>>>> >>>>> >>>>> >>>>> Warm regards, >>>>> >>>>> >>>>> >>>>> <image001.jpg> <https://www.alertlogic.com/> >>>>> >>>>> *Jim Williams* | Manager, Site Reliability Engineering >>>>> >>>>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com | >>>>> alertlogic.com <http://www.alertlogic.com/> <image002.png> >>>>> <https://twitter.com/alertlogic><image003.png> >>>>> <https://www.linkedin.com/company/alert-logic> >>>>> >>>>> >>>>> >>>>> <image004.png> >>>>> >>>>> >>>>> >>>>> *From:* Joe Witt <joe.w...@gmail.com> >>>>> *Sent:* Thursday, September 17, 2020 10:19 AM >>>>> *To:* users@nifi.apache.org >>>>> *Subject:* Re: Content Claims Filling Disk - Best practice for small >>>>> files? >>>>> >>>>> >>>>> >>>>> can you share your flow.xml.gz? >>>>> >>>>> >>>>> >>>>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson < >>>>> ryan.andrew.hendrick...@gmail.com> wrote: >>>>> >>>>> 1.12.0 >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Ryan >>>>> >>>>> >>>>> >>>>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <joe.w...@gmail.com> wrote: >>>>> >>>>> Ryan >>>>> >>>>> >>>>> >>>>> What version are you using? I do think we had an issue that kept items >>>>> around longer than intended that has been addressed. >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson < >>>>> ryan.andrew.hendrick...@gmail.com> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about >>>>> 55GB of data on my canvas. >>>>> >>>>> >>>>> >>>>> However, the content repository (on it's own partition) is >>>>> completely full with 350GB of data. I'm pretty certain the way Content >>>>> Claims store the data is responsible for this. In previous experience, >>>>> we've had files that are larger, and haven't seen this as much. >>>>> >>>>> >>>>> >>>>> My guess is that as data was streaming through and being added to a >>>>> claim, it isn't always released as the small files leaves the canvas. >>>>> >>>>> >>>>> >>>>> We've run into this issue enough times that I figure there's probably >>>>> a "best practice for small files" for the content claims settings. >>>>> >>>>> >>>>> >>>>> These are our current settings: >>>>> >>>>> >>>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository >>>>> >>>>> nifi.content.claim.max.appendable.size=1 MB >>>>> >>>>> nifi.content.claim.max.flow.files=100 >>>>> >>>>> >>>>> nifi.content.repository.directory.default=/var/nifi/repositories/content >>>>> >>>>> nifi.content.repository.archive.max.retention.period=12 hours >>>>> >>>>> nifi.content.repository.archive.max.usage.percentage=50% >>>>> >>>>> nifi.content.repository.archive.enabled=true >>>>> >>>>> nifi.content.repository.always.sync=false >>>>> >>>>> >>>>> >>>>> >>>>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository >>>>> >>>>> >>>>> >>>>> >>>>> There's 1024 folders on the disk (0-1023) for the Content Claims. >>>>> >>>>> Each file inside the folders are roughly 2MB to 8 MB (Which is odd >>>>> because I thought the max appendable size would make this no larger than >>>>> 1MB.) >>>>> >>>>> >>>>> >>>>> Is there a way to expand the number of folders and/or reduce the >>>>> amount of individual FlowFiles that are stored in the claims? >>>>> >>>>> >>>>> >>>>> I'm hoping there might be a best practice out there though. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Ryan >>>>> >>>>> >>>>> >>>>> Confidentiality Notice | This email and any included attachments may >>>>> be privileged, confidential and/or otherwise protected from disclosure. >>>>> Access to this email by anyone other than the intended recipient is >>>>> unauthorized. If you believe you have received this email in error, please >>>>> contact the sender immediately and delete all copies. If you are not the >>>>> intended recipient, you are notified that disclosing, copying, >>>>> distributing >>>>> or taking any action in reliance on the contents of this information is >>>>> strictly prohibited. >>>>> >>>>> *Disclaimer* >>>>> >>>>> The information contained in this communication from the sender is >>>>> confidential. It is intended solely for use by the recipient and others >>>>> authorized to receive it. If you are not the recipient, you are hereby >>>>> notified that any disclosure, copying, distribution or taking action in >>>>> relation of the contents of this information is strictly prohibited and >>>>> may >>>>> be unlawful. >>>>> >>>>> This email has been scanned for viruses and malware, and may have been >>>>> automatically archived by Mimecast, a leader in email security and cyber >>>>> resilience. Mimecast integrates email defenses with brand protection, >>>>> security awareness training, web security, compliance and other essential >>>>> capabilities. Mimecast helps protect large and small organizations from >>>>> malicious activity, human error and technology failure; and to lead the >>>>> movement toward building a more resilient world. To find out more, visit >>>>> our website. >>>>> >>>> >>>> >