We are experiencing this issue as well. We just upgraded from Nifi 1.11.4 to 
1.13.2, and are running in to this issue where many of our high-usage Nifi 
instances are just hanging. For example, we have a 7 node cluster that has 
flowfiles stuck in queues and not moving. We noticed that on 3 of those nodes, 
the flowfile content storage was over 50%, and those are the nodes that have 
flowfiles stuck in the queue. The other nodes have nothing on them. No new data 
is flowing in to the cluster at all, and nothing is moving on any of the nodes. 
We see this problem also on non-cluster machines; the cluster just makes it 
more obvious that this archive max usage percentage might be the cause.
We have a lot of merge content processors. We realize that there were a lot of 
I/O improvements in the newer version of Nifi - Joe, we suspect these 
efficiencies might be exacerbating the problem:
NiFi 1.13.1 - [full_list]
   
   - [NIFI-7646] - Improve performance of MergeContent / others that read 
content of many small FlowFiles   

   - [NIFI-8222] - When processing a lot of small FlowFiles, Provenance Repo 
spends most of its time in lock contention. That can be improved.

NiFi 1.14.0 - [full list]
   
   - [NIFI-8633] - Content Repository can be improved to make fewer disks 
accesses on read.
   
   - Mark Payne's notes: "For those interested in the actual performance 
numbers here, I ran a pretty simple flow that generated a lot of tiny JSON 
messages, and then used ConvertRecord to convert from JSON to Avro. Ran a 
profiler against it and found that about 50% of the time for ConvertRecord was 
spent in FileSystemRepository.read(). This is called twice - once when we read 
the data for inferring schema, a second time when we parse the data.   
   
Of the time spent in FileSystemRepository.read(), about 50% of that time was 
spent in Files.exists(). So this should improve performance of that flow by 
something like 25%"

We didn't know about the ...archive.backpressure.percentage property - we don't 
see it in the Admin guide 
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html. We will 
set this property to a lot higher than 2% above the max usage percentage and 
see how it goes. Now that we think about it, we believe we've experienced this 
problem occasionally before the upgrade, but it has become very frequent since 
the upgrade.

-Elli
    On Monday, May 3, 2021, 01:09:47 PM EDT, Shawn Weeks 
<[email protected]> wrote:  
 
 Sorry, I wasn't saying that 
'nifi.content.repository.archive.max.usage.percentage' was new I just hadn't 
managed to get a NiFi instance stuck this way and even the documentation says 
that if archive is empty and the content repo needs more room it would disable 
the archive. I'm having trouble find where ' 
nifi.content.repository.archive.backpressure.percentage' is documented.

Thanks

-----Original Message-----
From: Mark Payne <[email protected]> 
Sent: Monday, May 3, 2021 12:00 PM
To: [email protected]
Subject: Re: NiFi Get's Stuck Waiting On Non Existent Archive Cleanup

Shawn,

There are a couple of properties at play. The 
“nifi.content.repository.archive.max.usage.percentage" property behaves as you 
have described. But there’s also a second property: 
nifi.content.repository.archive.backpressure.percentage
This controls at what point the Content Repository will actually apply 
back-pressure in order to avoid filling the disk. This property defaults to 2% 
more than the the max.usage.percentage. So by default it uses 50% and 52%.
You can adjust the backpressure percentage to something much higher like 80%. 
So then if you reach 50% it would start clearing things out, and if you reach 
80% it’ll start applying the brakes. This is here as a safeguard because we’ve 
had data flows that can produce the data much faster than it could 
archive/delete the data. This is common for data flows that produce huge 
numbers of files in the content repository. So that backpressure is there to 
ensure that the archive has a chance to run.

This has always been here, though, ever since the initial open sourcing. Is not 
something new. It may be the case that in later versions we have been more 
efficient at creating the data, such that it’s now exceeding the rate that the 
cleanup can happen, not sure. But adjusting the 
“nifi.content.repository.archive.max.usage.percentage” property should get you 
into a better state.

Thanks
-Mark


> From: Shawn Weeks <[email protected]>
> Date: Mon, May 3, 2021 at 9:33 AM
> Subject: RE: NiFi Get's Stuck Waiting On Non Existent Archive Cleanup
> To: [email protected] <[email protected]>
> 
> 
> Note I have a 2 node cluster which is why it’s sitting at around 900 GB. Per 
> node content repo is sitting at 535gb currently and I’m not sure where the 
> rest of the space is. I have 472GB free on each node in the 
> content_repository partition as shown in the Cluster panel.
> 
>  
> 
> Thanks
> 
> Shawn Weeks
> 
>  
> 
> From: Shawn Weeks 
> Sent: Monday, May 3, 2021 11:30 AM
> To: [email protected]
> Subject: NiFi Get's Stuck Waiting On Non Existent Archive Cleanup
> 
>  
> 
> I’m not sure if this is specific to clustering or not but using the default 
> configuration with 50% content archiving it is possible to cause NiFi to quit 
> processing any data by simple filling up a queue with 50% of your 
> content_repository storage. In my example my content_repository is 1TB and 
> once a queue get’s to 500gb or so the next processor won’t process any more 
> data. Once this occurs even stopping GenerateFlowFile won’t fix the problem 
> and my CompressContent never does anything. It’s my understanding that 
> “nifi.content.repository.archive.max.usage.percentage” only set’s the max 
> amount of space that archive’s will use and should never prevent new content 
> from being written in the 1.13.2 it appears be functioning as a reserve 
> instead. I haven’t seen this in older versions of NiFi like 1.9.2 and I’m not 
> sure when the behavior changed but even the documentation seems to indicate 
> that this should not be happening. For example ‘If the archive is empty and 
> content repository disk usage is above this percentage, then archiving is 
> temporarily disabled.’
> 
>  
> 
> <image001.png>
> 
>  
> 
> <image002.png>
> 
>  
> 
> Thanks
> 
> Shawn Weeks
> 

  

Reply via email to