Re: Finding slow down in processing

Aaron Rich Mon, 15 Jan 2024 13:13:18 -0800

Out of curiosity - I'm seeing the metric "nifi_amount_threads_active" at
the max of 40 for processors that are disabled. Does that make sense? That
seems very odd to me since those processors should be doing anything at all.


On Mon, Jan 15, 2024 at 12:01 PM Aaron Rich <aaron.r...@gmail.com> wrote:

> Yeah - that gets the performance to where we need it.
>
> But the question I have is why did the performance drop in the first
> place. Everything was working fine, and then it suddenly dropped. I'm
> having to adjust nifi parameters to try to get back to where performance
> was but I can't find what is pulling the performance down in the first
> place.
>
> If there are any other suggestions, please let me know.
>
> Thanks.
>
> -Aaron
>
> On Mon, Jan 15, 2024 at 10:42 AM Mark Payne <marka...@hotmail.com> wrote:
>
>> Aaron,
>>
>> It doesn’t sound like you’re back to the drawing board at all - sounds
>> like you have the solution in hand. Just increase the size of your Timer
>> Driven Thread Pool and leave it there.
>>
>> Thanks
>> -Mark
>>
>>
>> On Jan 15, 2024, at 11:16 AM, Aaron Rich <aaron.r...@gmail.com> wrote:
>>
>> @Mark - thanks for that note. I hadn't tried restarting. When I did that,
>> the performance dropped back down. So I'm back to the drawing board.
>>
>> @Phillip - I didn't have any other services/components/dataflows going.
>> It was just those 2 processors going (I tried to remove every variable I
>> could to make it as controlled as possible). And during the week I ran that
>> test, there wasn't any slow down at all. Even when I turned on the rest of
>> the dataflows (~2500 components total) everything was performing
>> as expected. There is very, very little variability in data volumes so I
>> don't have any reason to believe that is the cause of the slow down.
>>
>> I'm going to try to see what kind of the nifi diagnostics and such I can
>> get.
>>
>> Is there anywhere that explains the output of nifi.sh dump and
>> nifi.sh diagnostics?
>>
>> Thanks all for the help.
>>
>> -Aaron
>>
>> On Fri, Jan 12, 2024 at 11:45 AM Phillip Lord <phillord0...@gmail.com>
>> wrote:
>>
>>> Ditto...
>>>
>>> @Aaron... so outside of the GenerateFlowFile -> PutFile, were there
>>> additional components/dataflows handling data at the same time as the
>>> "stress-test".  These will all share the same thread-pool.  So depending
>>> upon your dataflow footprint and any variability regarding data volumes...
>>> 20 timer-driven threads could be exhausted pretty quickly.  This might
>>> cause not only your "stress-test" to slow down but your other flows as well
>>> as components might be waiting for available threads to do their jobs.
>>>
>>> Thanks,
>>> Phil
>>>
>>> On Thu, Jan 11, 2024 at 3:44 PM Mark Payne <marka...@hotmail.com> wrote:
>>>
>>>> Aaron,
>>>>
>>>> Interestingly, up to version 1.21 of NiFi, if you increase the size of
>>>> the thread pool, it increased immediately. But if you decreased the size of
>>>> the thread pool, the decrease didn’t take effect until you restart NiFi. So
>>>> that’s probably why you’re seeing the behavior you are. Even though you
>>>> reset it to 10 or 20, it’s still running at 40.
>>>>
>>>> This was done to issues with Java many years ago, where it caused
>>>> problems to decrease the thread pool size.  So just recently we updated
>>>> NiFi to immediately scale down the thread pools as well.
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>>
>>>> On Jan 11, 2024, at 1:35 PM, Aaron Rich <aaron.r...@gmail.com> wrote:
>>>>
>>>> So the good news is it's working now. I know what I did but I don't
>>>> know why it worked so I'm hoping others can enlighten me based on what I
>>>> did.
>>>>
>>>> TL;DR - "turn it off/turn in on" for Max Timer Driven Thread Count
>>>> fixed performance. Max Timer Driven Thread Count was set to 20. I changed
>>>> it to 30 - performance increased. I changed to more to 40 - it increased. I
>>>> moved it back to 20 - performance was still up and what it originally was
>>>> before ever slowing down.
>>>>
>>>> (this is long to give background and details)
>>>> NiFi version: 1.19.1
>>>>
>>>> NiFi was deployed into a Kubernetes cluster as a single instance - no
>>>> NiFi clustering. We would set a CPU request of 4, and limit of 8, memory
>>>> request of 8, limit of 12. The repos are all volumed mounted out to ssd.
>>>>
>>>> The original deployment was as described above and Max Timer Driven
>>>> Thread Count was set to 20. We ran a very simple data flow
>>>> (generatoeFile->PutFile) AFAP to try to stress as much as possible before
>>>> starting our other data flows. That ran for a week with no issue doing
>>>> 20K/5m.
>>>> We turned on the other data flows and everything was processing as
>>>> expected, good throughput rates and things were happy.
>>>> Then the throughput dropped DRAMATICALLY to (instead of 11K/5m in an
>>>> UpdateAttribute, it went to 350/5m) after 3 days. The data being processed
>>>> did not change in volume/cadence/velocity/etc.
>>>> Rancher Cluster explorer dashboards didn't show resources standing out
>>>> as limiting or constraining.
>>>> Tried restarting workload in Kubernetes, and data flows were slow right
>>>> from start - so there wasn't a ramp up or any degradation over time - it
>>>> was just slow to begin.
>>>> Tried removing all the repos/state so NiFi came up clean incase it was
>>>> the historical data that was issue - still slow from start.
>>>> Tried changing node in Kube Cluster incase node was bad - still slow
>>>> from start.
>>>> Removed CPU limit (allowing NiFi to potentially use all 16 cores on
>>>> node) from deployment to see if there was CPU throttling happening that I
>>>> wasn't able to see on the Grafana dashboards - still slow from start.
>>>> While NiFi was running, I changed the Max Timer Driven Thread Count
>>>> from 20->30, performance picked up. Changed it again from 30->40,
>>>> performance picked up. I changed from 40->10, performance stayed up. I
>>>> changed from 10-20, performance stayed up and was at the original amount
>>>> before slow down every happened.
>>>>
>>>> So end of the day, the Max Timer Driven Thread Count is at exactly what
>>>> it was before but the performance changed. It's like something was "stuck".
>>>> It's very, very odd to me to see things be fine, degrade for days and
>>>> through multiple environment changes/debugging, and then return to fine
>>>> when I change a parameter to a different value->back to original value.
>>>> Effectively, I "turned it off/turned it on" with the Max Timer Driven
>>>> Thread Count value.
>>>>
>>>> My question is - what is happening under the hood when the Max Timer
>>>> Driven Thread Count is changed? What does that affect? Is there something I
>>>> could look at from Kubernetes' side potentially that would relate to that
>>>> value?
>>>>
>>>> Could an internal NiFi thread gotten stuck and changing that value
>>>> rebuilt the thread pool? If that is even possible? If that is
>>>> even possible, is any way to know what caused the thread to "get stuck" in
>>>> the first place?
>>>>
>>>> Any insight would be greatly appreciated!
>>>>
>>>> Thanks so much for all the suggestions and help on this.
>>>>
>>>> -Aaron
>>>>
>>>>
>>>>
>>>> On Wed, Jan 10, 2024 at 1:54 PM Aaron Rich <aaron.r...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Joe,
>>>>>
>>>>> Nothing is load balanced- it's all basic queues.
>>>>>
>>>>> Mark,
>>>>> I'm using NiFi 1.19.1.
>>>>>
>>>>> nifi.performance.tracking.percentage sounds exactly what I might need.
>>>>> I'll give that a shot.
>>>>>
>>>>> Richard,
>>>>> I hadn't looked at the rotating logs and/or cleared them out. I'll
>>>>> give that a shot too.
>>>>>
>>>>> Thank you all. Please keep the suggestions coming.
>>>>>
>>>>> -Aaron
>>>>>
>>>>> On Wed, Jan 10, 2024 at 1:34 PM Richard Beare <richard.be...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I had a similar sounding issue, although not in a Kube cluster. Nifi
>>>>>> was running in a docker container and the issue was the log rotation
>>>>>> interacting with the log file being mounted from the host. The mounted 
>>>>>> log
>>>>>> file was not deleted on rotation, meaning that once rotation was 
>>>>>> triggered
>>>>>> by log file size it would be continually triggered because the new log 
>>>>>> file
>>>>>> was never emptied. The clue was that the content of rotated logfiles was
>>>>>> mostly the same, with only a small number of messages appended to each 
>>>>>> new
>>>>>> one. Rotating multi GB logs was enough to destroy performance, especially
>>>>>> if it was being triggered frequently by debug messages.
>>>>>>
>>>>>> On Thu, Jan 11, 2024 at 7:14 AM Aaron Rich <aaron.r...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Joe,
>>>>>>>
>>>>>>> It's a pretty fixed size objects at a fixed interval- One 5mb-ish
>>>>>>> file, we break down to individual rows.
>>>>>>>
>>>>>>> I went so far as to create a "stress test" where I have a
>>>>>>> generateFlow( creating a fix, 100k fille, in batches of 1000, every .1s)
>>>>>>> feeding right into a putFile. I wanted to see the sustained max. It was
>>>>>>> very stable, fast for over a week running - but now it's extremely slow.
>>>>>>> That was able as simple of a data flow I could think of to hit all the
>>>>>>> different resources (CPU, memory
>>>>>>>
>>>>>>> I was thinking too, maybe it was memory but it's slow right at the
>>>>>>> start when starting NiFi. I would expect the memory to cause it to be
>>>>>>> slower over time, and the stress test showed it wasn't something that 
>>>>>>> was
>>>>>>> fluenting over time.
>>>>>>>
>>>>>>> I'm happy to make other flows that anyone can suggest to help
>>>>>>> troubleshoot, diagnose issue.
>>>>>>>
>>>>>>> Lars,
>>>>>>>
>>>>>>> We haven't changed it between when performance was good and now when
>>>>>>> it's slow. That is what is throwing me - nothing changed from NiFi
>>>>>>> configuration standby.
>>>>>>> My guess is we are having some throttling/resource contention from
>>>>>>> our provider but I can't determine what/where/how. The Grafana cluster
>>>>>>> dashboards I have don't indicate issues. If there are suggestions for
>>>>>>> specific cluster metrics to plot/dashboards to use, I'm happy to build 
>>>>>>> them
>>>>>>> and contribute them back (I do have a dashboard I need to figure out 
>>>>>>> how to
>>>>>>> share for creating the "status history" plots in Grafana).
>>>>>>> The repos aren't full and I tried even blowing them away just to see
>>>>>>> if that made a difference.
>>>>>>> I'm not seeing anything new in the logs that indicate an issue...but
>>>>>>> maybe I'm missing it so I will try to look again
>>>>>>>
>>>>>>> By chance, are there any low level debugging
>>>>>>> metrics/observability/etc that would show how long things like writing 
>>>>>>> to
>>>>>>> the repository disks is taking? There is a part of me that feels this 
>>>>>>> could
>>>>>>> be a Disk I/O resource issue but I don't know how I can verify that
>>>>>>> is/isn't the issue.
>>>>>>>
>>>>>>> Thank you all for the help and suggestions - please keep them coming
>>>>>>> as I'm grasping at straws right now.
>>>>>>>
>>>>>>> -Aaron
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 10, 2024 at 10:10 AM Joe Witt <joe.w...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Aaron,
>>>>>>>>
>>>>>>>> The usual suspects are memory consumption leading to high GC
>>>>>>>> leading to lower performance over time, or back pressure in the flow, 
>>>>>>>> etc..
>>>>>>>> But your description does not really fit either exactly.  Does your 
>>>>>>>> flow
>>>>>>>> see a mix of large objects and smaller objects?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Wed, Jan 10, 2024 at 10:07 AM Aaron Rich <aaron.r...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m running into an odd issue and hoping someone can point me in
>>>>>>>>> the right direction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have NiFi 1.19 deployed in a Kube cluster with all the
>>>>>>>>> repositories volume mounted out. It was processing great with 
>>>>>>>>> processors
>>>>>>>>> like UpdateAttribute sending through 15K/5m PutFile sending through 
>>>>>>>>> 3K/5m.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With nothing changing in the deployment, the performance has
>>>>>>>>> dropped to UpdateAttribute doing 350/5m and Putfile to 200/5m.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m trying to determine what resource is suddenly dropping our
>>>>>>>>> performance like this. I don’t see anything on the Kube monitoring 
>>>>>>>>> that
>>>>>>>>> stands out and I have restarted, cleaned repos, changed nodes but 
>>>>>>>>> nothing
>>>>>>>>> is helping.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I was hoping there is something from the NiFi POV that can help
>>>>>>>>> identify the limiting resource. I'm not sure if there is additional
>>>>>>>>> diagnostic/debug/etc information available beyond the node status 
>>>>>>>>> graphs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Any help would be greatly appreciated.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Aaron
>>>>>>>>>
>>>>>>>>
>>>>
>>

Re: Finding slow down in processing

Reply via email to