Hey Mark, I should have mentioned the PutElasticsearchHttp is going to 2 different clusters. We did play with different thread counts for each of them. At one point were wondering if too large a Batch Size would make the threads block each.
It looks like PutElasticsearchHttp serializes every FlowFile to verify it's a well-formed JSON document [1]. That alone feels pretty CPU expensive.. In our case, we know already we have valid JSON. Just as an anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses a total of 9 threads to accomplish the same thing that [2x DistributeLoad + 2x PutElasticsearchHTTP] does with 50 threads. DistributeLoad's need 5 threads each to keep up. PutElasticsearchHTTP needs about 10 each. PutElasticsearchHTTP is configured like this: Index: ${esIndex} Batch Size: 3000 Index Operation: Index For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to export TOOLS_JAR on the command line to the path where tools.jar was located. I'm not getting a file written out though. I still have the "full" NiFi up and running. I assume that should be? Do I need to change my logback.xml levels at all? [1] https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299 Thanks, Ryan On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <marka...@hotmail.com> wrote: > Ryan, > > Why are you using DistributeLoad to go to two different > PutElasticsearchHttp processors? Does that perform better for you than a > single PutElasticsearchHttp processors with multiple concurrent tasks? It > shouldn’t really. I’ve never used that processor, but if two instances of > the processor perform significantly better than 1 instance with 2 > concurrent tasks, that’s probably worth looking into. > > -Mark > > > On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson < > ryan.andrew.hendrick...@gmail.com> wrote: > > @Joe I can't export the flow.xml.gz easily, although it's pretty simple. > We put just the following on it's own server because DistributeLoad (bug > [1]) and PutElasticsearchHttp have a hard time keeping up. > > 1. Input Port > 2. ControlRate (data rate | 1.7GB | 5 min) > 3. Update Attributes (Delete Attribute Regex) > 4. JoltTransformJSON > 5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it > into Objects) > 6. DistributeLoad > 1. PutElasticsearchHttp > 2. PutElasticsearchHttp > > > Unrelated.. We're experimenting with a MergeContent + InvokeHTTP combo to > see if that's more performant than PutElasticsearchHttp.. The Elastic one > uses an ObjectMapper, and string replacements, etc. It seems to cap out > around 2-3GB/5 minutes > > @Mark I'll check the diagnostics. > > @Jim definitely disk space 100% used. > > [1] https://issues.apache.org/jira/browse/NIFI-1121 > > Ryan > > On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jwilli...@alertlogic.com> > wrote: > >> Ryan, >> >> >> >> Is this this maybe a case of exhausting inodes on the filesystem rather >> than exhausting the space available? If you do a ‘df -I’ on the system >> what do you see for inode usage? >> >> >> >> Warm regards, >> >> >> >> <image001.jpg> <https://www.alertlogic.com/> >> >> *Jim Williams* | Manager, Site Reliability Engineering >> >> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com | >> alertlogic.com <http://www.alertlogic.com/> <image002.png> >> <https://twitter.com/alertlogic><image003.png> >> <https://www.linkedin.com/company/alert-logic> >> >> >> >> <image004.png> >> >> >> >> *From:* Joe Witt <joe.w...@gmail.com> >> *Sent:* Thursday, September 17, 2020 10:19 AM >> *To:* users@nifi.apache.org >> *Subject:* Re: Content Claims Filling Disk - Best practice for small >> files? >> >> >> >> can you share your flow.xml.gz? >> >> >> >> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson < >> ryan.andrew.hendrick...@gmail.com> wrote: >> >> 1.12.0 >> >> >> >> Thanks, >> >> Ryan >> >> >> >> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <joe.w...@gmail.com> wrote: >> >> Ryan >> >> >> >> What version are you using? I do think we had an issue that kept items >> around longer than intended that has been addressed. >> >> >> >> Thanks >> >> >> >> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson < >> ryan.andrew.hendrick...@gmail.com> wrote: >> >> Hello, >> >> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB >> of data on my canvas. >> >> >> >> However, the content repository (on it's own partition) is >> completely full with 350GB of data. I'm pretty certain the way Content >> Claims store the data is responsible for this. In previous experience, >> we've had files that are larger, and haven't seen this as much. >> >> >> >> My guess is that as data was streaming through and being added to a >> claim, it isn't always released as the small files leaves the canvas. >> >> >> >> We've run into this issue enough times that I figure there's probably a >> "best practice for small files" for the content claims settings. >> >> >> >> These are our current settings: >> >> >> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository >> >> nifi.content.claim.max.appendable.size=1 MB >> >> nifi.content.claim.max.flow.files=100 >> >> nifi.content.repository.directory.default=/var/nifi/repositories/content >> >> nifi.content.repository.archive.max.retention.period=12 hours >> >> nifi.content.repository.archive.max.usage.percentage=50% >> >> nifi.content.repository.archive.enabled=true >> >> nifi.content.repository.always.sync=false >> >> >> >> >> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository >> >> >> >> >> There's 1024 folders on the disk (0-1023) for the Content Claims. >> >> Each file inside the folders are roughly 2MB to 8 MB (Which is odd >> because I thought the max appendable size would make this no larger than >> 1MB.) >> >> >> >> Is there a way to expand the number of folders and/or reduce the amount >> of individual FlowFiles that are stored in the claims? >> >> >> >> I'm hoping there might be a best practice out there though. >> >> >> >> Thanks, >> >> Ryan >> >> >> >> Confidentiality Notice | This email and any included attachments may be >> privileged, confidential and/or otherwise protected from disclosure. Access >> to this email by anyone other than the intended recipient is unauthorized. >> If you believe you have received this email in error, please contact the >> sender immediately and delete all copies. If you are not the intended >> recipient, you are notified that disclosing, copying, distributing or >> taking any action in reliance on the contents of this information is >> strictly prohibited. >> >> *Disclaimer* >> >> The information contained in this communication from the sender is >> confidential. It is intended solely for use by the recipient and others >> authorized to receive it. If you are not the recipient, you are hereby >> notified that any disclosure, copying, distribution or taking action in >> relation of the contents of this information is strictly prohibited and may >> be unlawful. >> >> This email has been scanned for viruses and malware, and may have been >> automatically archived by Mimecast, a leader in email security and cyber >> resilience. Mimecast integrates email defenses with brand protection, >> security awareness training, web security, compliance and other essential >> capabilities. Mimecast helps protect large and small organizations from >> malicious activity, human error and technology failure; and to lead the >> movement toward building a more resilient world. To find out more, visit >> our website. >> > >