Re: Content Claims Filling Disk - Best practice for small files?

Ryan Hendrickson Thu, 17 Sep 2020 11:30:28 -0700

Correction - it did work.  I was expecting it to be in the same folder as
where I ran nifi.sh from, vs NIFI_HOME.


Reviewing it now...

Ryan

On Thu, Sep 17, 2020 at 1:51 PM Ryan Hendrickson <
ryan.andrew.hendrick...@gmail.com> wrote:

> Hey Mark,
> I should have mentioned the PutElasticsearchHttp is going to 2 different
> clusters.  We did play with different thread counts for each of them.  At
> one point were wondering if too large a Batch Size would make the threads
> block each.
>
> It looks like PutElasticsearchHttp serializes every FlowFile to verify
> it's a well-formed JSON document [1].  That alone feels pretty CPU
> expensive.. In our case, we know already we have valid JSON.  Just as an
> anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
> a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
> 2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
> threads each to keep up.  PutElasticsearchHTTP needs about 10 each.
>
> PutElasticsearchHTTP is configured like this:
> Index: ${esIndex}
> Batch Size: 3000
> Index Operation: Index
>
> For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to export
> TOOLS_JAR on the command line to the path where tools.jar was located.
>
> I'm not getting a file written out though.  I still have the "full" NiFi
> up and running.  I assume that should be?  Do I need to change my
> logback.xml levels at all?
>
>
> [1]
> https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299
>
> Thanks,
> Ryan
>
> On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <marka...@hotmail.com> wrote:
>
>> Ryan,
>>
>> Why are you using DistributeLoad to go to two different
>> PutElasticsearchHttp processors? Does that perform better for you than a
>> single PutElasticsearchHttp processors with multiple concurrent tasks? It
>> shouldn’t really. I’ve never used that processor, but if two instances of
>> the processor perform significantly better than 1 instance with 2
>> concurrent tasks, that’s probably worth looking into.
>>
>> -Mark
>>
>>
>> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
>> ryan.andrew.hendrick...@gmail.com> wrote:
>>
>> @Joe I can't export the flow.xml.gz easily, although it's pretty simple.
>> We put just the following on it's own server because DistributeLoad (bug
>> [1]) and PutElasticsearchHttp have a hard time keeping up.
>>
>>    1. Input Port
>>    2. ControlRate (data rate | 1.7GB | 5 min)
>>    3. Update Attributes (Delete Attribute Regex)
>>    4. JoltTransformJSON
>>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it
>>    into Objects)
>>    6. DistributeLoad
>>       1. PutElasticsearchHttp
>>       2. PutElasticsearchHttp
>>
>>
>> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo
>> to see if that's more performant than PutElasticsearchHttp.. The Elastic
>> one uses an ObjectMapper, and string replacements, etc.  It seems to cap
>> out around 2-3GB/5 minutes
>>
>> @Mark I'll check the diagnostics.
>>
>> @Jim definitely disk space 100% used.
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-1121
>>
>> Ryan
>>
>> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jwilli...@alertlogic.com>
>> wrote:
>>
>>> Ryan,
>>>
>>>
>>>
>>> Is this this maybe a case of exhausting inodes on the filesystem rather
>>> than exhausting the space available?  If you do a ‘df -I’ on the system
>>> what do you see for inode usage?
>>>
>>>
>>>
>>> Warm regards,
>>>
>>>
>>>
>>> <image001.jpg> <https://www.alertlogic.com/>
>>>
>>> *Jim Williams* | Manager, Site Reliability Engineering
>>>
>>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com |
>>> alertlogic.com <http://www.alertlogic.com/> <image002.png>
>>> <https://twitter.com/alertlogic><image003.png>
>>> <https://www.linkedin.com/company/alert-logic>
>>>
>>>
>>>
>>> <image004.png>
>>>
>>>
>>>
>>> *From:* Joe Witt <joe.w...@gmail.com>
>>> *Sent:* Thursday, September 17, 2020 10:19 AM
>>> *To:* users@nifi.apache.org
>>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>>> files?
>>>
>>>
>>>
>>> can you share your flow.xml.gz?
>>>
>>>
>>>
>>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>>> ryan.andrew.hendrick...@gmail.com> wrote:
>>>
>>> 1.12.0
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ryan
>>>
>>>
>>>
>>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <joe.w...@gmail.com> wrote:
>>>
>>> Ryan
>>>
>>>
>>>
>>> What version are you using? I do think we had an issue that kept items
>>> around longer than intended that has been addressed.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>>> ryan.andrew.hendrick...@gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>>> of data on my canvas.
>>>
>>>
>>>
>>> However, the content repository (on it's own partition) is
>>> completely full with 350GB of data.  I'm pretty certain the way Content
>>> Claims store the data is responsible for this.  In previous experience,
>>> we've had files that are larger, and haven't seen this as much.
>>>
>>>
>>>
>>> My guess is that as data was streaming through and being added to a
>>> claim, it isn't always released as the small files leaves the canvas.
>>>
>>>
>>>
>>> We've run into this issue enough times that I figure there's probably a
>>> "best practice for small files" for the content claims settings.
>>>
>>>
>>>
>>> These are our current settings:
>>>
>>>
>>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>>
>>> nifi.content.claim.max.appendable.size=1 MB
>>>
>>> nifi.content.claim.max.flow.files=100
>>>
>>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>>
>>> nifi.content.repository.archive.max.retention.period=12 hours
>>>
>>> nifi.content.repository.archive.max.usage.percentage=50%
>>>
>>> nifi.content.repository.archive.enabled=true
>>>
>>> nifi.content.repository.always.sync=false
>>>
>>>
>>>
>>>
>>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>>
>>>
>>>
>>>
>>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>>
>>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>>> because I thought the max appendable size would make this no larger than
>>> 1MB.)
>>>
>>>
>>>
>>> Is there a way to expand the number of folders and/or reduce the amount
>>> of individual FlowFiles that are stored in the claims?
>>>
>>>
>>>
>>> I'm hoping there might be a best practice out there though.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ryan
>>>
>>>
>>>
>>> Confidentiality Notice | This email and any included attachments may be
>>> privileged, confidential and/or otherwise protected from disclosure. Access
>>> to this email by anyone other than the intended recipient is unauthorized.
>>> If you believe you have received this email in error, please contact the
>>> sender immediately and delete all copies. If you are not the intended
>>> recipient, you are notified that disclosing, copying, distributing or
>>> taking any action in reliance on the contents of this information is
>>> strictly prohibited.
>>>
>>> *Disclaimer*
>>>
>>> The information contained in this communication from the sender is
>>> confidential. It is intended solely for use by the recipient and others
>>> authorized to receive it. If you are not the recipient, you are hereby
>>> notified that any disclosure, copying, distribution or taking action in
>>> relation of the contents of this information is strictly prohibited and may
>>> be unlawful.
>>>
>>> This email has been scanned for viruses and malware, and may have been
>>> automatically archived by Mimecast, a leader in email security and cyber
>>> resilience. Mimecast integrates email defenses with brand protection,
>>> security awareness training, web security, compliance and other essential
>>> capabilities. Mimecast helps protect large and small organizations from
>>> malicious activity, human error and technology failure; and to lead the
>>> movement toward building a more resilient world. To find out more, visit
>>> our website.
>>>
>>
>>

Re: Content Claims Filling Disk - Best practice for small files?

Reply via email to