Correction - it did work.  I was expecting it to be in the same folder as
where I ran from, vs NIFI_HOME.

Reviewing it now...


On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson <> wrote:

> Hey Mark,
> I should have mentioned the PutElasticsearchHttp is going to 2 different
> clusters.  We did play with different thread counts for each of them.  At
> one point were wondering if too large a Batch Size would make the threads
> block each.
> It looks like PutElasticsearchHttp serializes every FlowFile to verify
> it's a well-formed JSON document [1].  That alone feels pretty CPU
> expensive.. In our case, we know already we have valid JSON.  Just as an
> anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
> a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
> 2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
> threads each to keep up.  PutElasticsearchHTTP needs about 10 each.
> PutElasticsearchHTTP is configured like this:
> Index: ${esIndex}
> Batch Size: 3000
> Index Operation: Index
> For the ./ diagnostics --verbose diagnostics1.txt, I had to export
> TOOLS_JAR on the command line to the path where tools.jar was located.
> I'm not getting a file written out though.  I still have the "full" NiFi
> up and running.  I assume that should be?  Do I need to change my
> logback.xml levels at all?
> [1]
> Thanks,
> Ryan
> On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <> wrote:
>> Ryan,
>> Why are you using DistributeLoad to go to two different
>> PutElasticsearchHttp processors? Does that perform better for you than a
>> single PutElasticsearchHttp processors with multiple concurrent tasks? It
>> shouldn’t really. I’ve never used that processor, but if two instances of
>> the processor perform significantly better than 1 instance with 2
>> concurrent tasks, that’s probably worth looking into.
>> -Mark
>> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
>>> wrote:
>> @Joe I can't export the flow.xml.gz easily, although it's pretty simple.
>> We put just the following on it's own server because DistributeLoad (bug
>> [1]) and PutElasticsearchHttp have a hard time keeping up.
>>    1. Input Port
>>    2. ControlRate (data rate | 1.7GB | 5 min)
>>    3. Update Attributes (Delete Attribute Regex)
>>    4. JoltTransformJSON
>>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it
>>    into Objects)
>>    6. DistributeLoad
>>       1. PutElasticsearchHttp
>>       2. PutElasticsearchHttp
>> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo
>> to see if that's more performant than PutElasticsearchHttp.. The Elastic
>> one uses an ObjectMapper, and string replacements, etc.  It seems to cap
>> out around 2-3GB/5 minutes
>> @Mark I'll check the diagnostics.
>> @Jim definitely disk space 100% used.
>> [1]
>> Ryan
>> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <>
>> wrote:
>>> Ryan,
>>> Is this this maybe a case of exhausting inodes on the filesystem rather
>>> than exhausting the space available?  If you do a ‘df -I’ on the system
>>> what do you see for inode usage?
>>> Warm regards,
>>> <image001.jpg> <>
>>> *Jim Williams* | Manager, Site Reliability Engineering
>>> O: +1 713.341.7812 | C: +1 919.523.8767 | |
>>> <> <image002.png>
>>> <><image003.png>
>>> <>
>>> <image004.png>
>>> *From:* Joe Witt <>
>>> *Sent:* Thursday, September 17, 2020 10:19 AM
>>> *To:*
>>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>>> files?
>>> can you share your flow.xml.gz?
>>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>>>> wrote:
>>> 1.12.0
>>> Thanks,
>>> Ryan
>>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <> wrote:
>>> Ryan
>>> What version are you using? I do think we had an issue that kept items
>>> around longer than intended that has been addressed.
>>> Thanks
>>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>>>> wrote:
>>> Hello,
>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>>> of data on my canvas.
>>> However, the content repository (on it's own partition) is
>>> completely full with 350GB of data.  I'm pretty certain the way Content
>>> Claims store the data is responsible for this.  In previous experience,
>>> we've had files that are larger, and haven't seen this as much.
>>> My guess is that as data was streaming through and being added to a
>>> claim, it isn't always released as the small files leaves the canvas.
>>> We've run into this issue enough times that I figure there's probably a
>>> "best practice for small files" for the content claims settings.
>>> These are our current settings:
>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>> nifi.content.claim.max.appendable.size=1 MB
>>> nifi.content.claim.max.flow.files=100
>>> nifi.content.repository.archive.max.retention.period=12 hours
>>> nifi.content.repository.archive.max.usage.percentage=50%
>>> nifi.content.repository.archive.enabled=true
>>> nifi.content.repository.always.sync=false
>>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>>> because I thought the max appendable size would make this no larger than
>>> 1MB.)
>>> Is there a way to expand the number of folders and/or reduce the amount
>>> of individual FlowFiles that are stored in the claims?
>>> I'm hoping there might be a best practice out there though.
>>> Thanks,
>>> Ryan
>>> Confidentiality Notice | This email and any included attachments may be
>>> privileged, confidential and/or otherwise protected from disclosure. Access
>>> to this email by anyone other than the intended recipient is unauthorized.
>>> If you believe you have received this email in error, please contact the
>>> sender immediately and delete all copies. If you are not the intended
>>> recipient, you are notified that disclosing, copying, distributing or
>>> taking any action in reliance on the contents of this information is
>>> strictly prohibited.
>>> *Disclaimer*
>>> The information contained in this communication from the sender is
>>> confidential. It is intended solely for use by the recipient and others
>>> authorized to receive it. If you are not the recipient, you are hereby
>>> notified that any disclosure, copying, distribution or taking action in
>>> relation of the contents of this information is strictly prohibited and may
>>> be unlawful.
>>> This email has been scanned for viruses and malware, and may have been
>>> automatically archived by Mimecast, a leader in email security and cyber
>>> resilience. Mimecast integrates email defenses with brand protection,
>>> security awareness training, web security, compliance and other essential
>>> capabilities. Mimecast helps protect large and small organizations from
>>> malicious activity, human error and technology failure; and to lead the
>>> movement toward building a more resilient world. To find out more, visit
>>> our website.

Reply via email to