[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487733#comment-16487733
 ] 

ASF GitHub Bot commented on NIFI-5225:
--

Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/2732


> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
> Fix For: 1.7.0
>
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487732#comment-16487732
 ] 

ASF GitHub Bot commented on NIFI-5225:
--

Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2732
  
@FrederikP  all looks good here. I have merged the changes to master. 
Thanks for the fix!


> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
> Fix For: 1.7.0
>
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487730#comment-16487730
 ] 

ASF subversion and git services commented on NIFI-5225:
---

Commit d75ba167cd93042c3f747f4aacb507617694bc0c in nifi's branch 
refs/heads/master from [~FrederikP]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=d75ba16 ]

NIFI-5225: Purge event data from event repository when Connectable is removed

This closes #2732.

Signed-off-by: Mark Payne 


> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-23 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487688#comment-16487688
 ] 

Mark Payne commented on NIFI-5225:
--

[~FrederikP] I think you're right - there is certainly room to improve how we 
do the whole generateReport() stuff. I was noticing some of this stuff the 
other day as well when working on NIFI-5112/NIFI-950. There are quite a few 
inefficiencies. EventSumValue, for instance, is holding 15 
AtomicLong/AtomicIntegers and a ConcurrentHashMap that get updated with each 
session commit and then have to be read each time we call generateReport() - 
for each component. That would probably be significantly more efficient to just 
use non-thread-safe member variables and synchronize the methods. Atomics are 
fast when you only need to update 1-2 of them but when you get to that many, 
it'll slow you down a good bit. But I think this is just due to the way that 
the system as evolved. We could also avoid calculating these for components we 
don't care about, as you mentioned, as well.

With the results that I found during my profiling, this wasn't much of an 
issue. However, I also was mostly focusing on small flows, not flows with 
1,000+ processors. That said, I think we should probably finish up the 
outstanding issues around NIFI-950, etc. and see where that leaves us. We can 
then certainly iterate to improve further if necessary.

> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian

[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-22 Thread Frederik Petersen (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484072#comment-16484072
 ] 

Frederik Petersen commented on NIFI-5225:
-

[~joewitt] + [~markap14] thanks!


_Did you verify this addressed your case successfully?_ Yes we are already 
running a patched 1.5.0 version on our production systems that don't have the 
original issues anymore.

_Are you in a position to try your usage and provide analysis on the latest 
apache master?_ Currently we are running HDF-3.1.0.0 and I am not sure if we 
currently want to fiddle with it to use the latest master. We'd need to change 
our development environment to more closely replicate what we have on 
production, but I don't think we currently have the time for that. But I'm to 
intrigued by the fixed issues (5112 + 5136) as we are currently seeing high 
latency for web requests.

Something I also noticed while looking into this leak is that 
SecondPrecisionEventContainer.generateReport() takes up a relatively big amount 
of time even when the cluster has just been started. Many important resources 
(like createConnection/Ports/Processor) call FlowController.getGroupStatus, 
that in turn leads to calling generateReport for all processors/connections. 
When we instantiate templates/create processors/create connections using the 
API then this is done many times per component. I think this is quite a waste 
of resources (and visualvm sampling confirms that, because close to 100% of the 
sampled Web Threads spend time in the generateReport method). I don't even 
understand why these stats are extracted for the creation of a component. It's 
probably some sort of an oversight. And even for the resources that need to 
supply these stats for the UI, I think it would be good if we could set a flag 
when using to API, that we are not interested in these stats at all. Just some 
thoughts I had when reading through the code today.

I think these issues 'hit' us quite hard because we are running nifi on 8 
machines and have over a thousand processors in the flow. We've already thought 
about splitting the flow up due to these issues. But with the patch for this 
issue, I think we can start going forward and hope that future releases make 
everything more smooth.

> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entrie

[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-22 Thread Mark Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483977#comment-16483977
 ] 

Mark Payne commented on NIFI-5225:
--

[~FrederikP] great find! And yes, i do agree there's an issue there. I will be 
happy to review your patch today and hopefully we can have all of this sealed 
up very shortly! And thanks for not only reporting the issue but giving great 
detail about the issue, and then even supplying a fix! Very much appreciated.

> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-22 Thread Joseph Witt (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483967#comment-16483967
 ] 

Joseph Witt commented on NIFI-5225:
---

also [~FrederikP] you might consider if you can testing with your patch on 
latest apache master

You could see slower cluster performance due to 
https://issues.apache.org/jira/browse/NIFI-5112

And there were other memory management issues recently sorted related to 
classloading with https://issues.apache.org/jira/browse/NIFI-5136

You should be free/fine to have consistent programmatic access to the REST API.

Are you in a position to try your usage and provide analysis on the latest 
apache master?

> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.5.0, 1.6.0
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-22 Thread Joseph Witt (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483964#comment-16483964
 ] 

Joseph Witt commented on NIFI-5225:
---

[~FrederikP] this is extremely impressive!  Did you verify this addressed your 
case successfully?  Talking with mark payne offline he agreed there was a 
problem here and your update makes a ton of sense!

> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.5.0, 1.6.0
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issue is:
> When we remove processors, connections, process groups from the flow, their 
> data is not removed from the ConcurrentHashMap in RingBufferEventRepository. 
> There is a 'purgeTransferEvents' but it only calls an empty 'purgeEvents' 
> method on all 'SecondPrecisionEventContainer's in the map.
> This means that the map grows without bounds and every time 
> 'reportTransferEvents' is called it iterates over all (meaning more and more 
> over time) entries of the map. This increases latency of every web request 
> and also a huge amount of memory occupied.
> A rough idea to fix this:
> Remove the entry for each removed component (processor, process group, 
> connection, ?...) using their onRemoved Methods in the FlowController
> This should stop the map from growing infinitely for any flow where removals 
> of any components happens frequently. Especially when automated.
> Since this is quite urgent for us, I'll try to work on a fix for this and 
> provide a pull request if successful.
> Since no-one noticed this before, I guess we are not the typical user of 
> NiFi, as we thought it was possible to heavily reconfigure flows using the 
> API, but with this performance issue, it's not.
> Please let me know if I can provide any more helpful detail for this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5225) Leak in RingBufferEventRepository for frequently updated flows

2018-05-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483946#comment-16483946
 ] 

ASF GitHub Bot commented on NIFI-5225:
--

GitHub user FrederikP opened a pull request:

https://github.com/apache/nifi/pull/2732

NIFI-5225: Purge event data from event repository when Connectable is 
removed

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced 
 in the commit message?

- [x] Does your PR title start with NIFI- where  is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.

- [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

- [x] Is your initial contribution a single, squashed commit?

### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
_Clean install ran through just fine, but contrib-check complained about an 
unrelated package_
- [x] Have you written or updated unit tests to verify your changes?
- [ ] ~~If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?~~ 
- [ ] ~~If applicable, have you updated the LICENSE file, including the 
main LICENSE file under nifi-assembly?~~
- [ ] ~~If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?~~
- [ ] ~~If adding new Properties, have you added .displayName in addition 
to .name (programmatic access) for each of the new properties?~~

### For documentation related changes:
- ~~[ ] Have you ensured that format looks appropriate for the output in 
which it is rendered?~~

I introduced the option to purge data from the FlowFileEventRepository (the 
5 min ring buffer) to fix this:
https://issues.apache.org/jira/browse/NIFI-5225

And it works for our setup.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/FrederikP/nifi master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/2732.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2732


commit 4e5a118305c9513cca239c136c48239c501e9907
Author: Frederik Petersen 
Date:   2018-05-22T10:55:59Z

NIFI-5225: Purge event data from event repository when Connectable is 
removed




> Leak in RingBufferEventRepository for frequently updated flows
> --
>
> Key: NIFI-5225
> URL: https://issues.apache.org/jira/browse/NIFI-5225
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Core Framework
>Affects Versions: 1.5.0, 1.6.0
> Environment: HDF-3.1.0.0
>Reporter: Frederik Petersen
>Priority: Major
>  Labels: performance
>
> We use NiFi's API to change a part of our flow quite frequently. Over the 
> past weeks we have noticed that the performance of web requests degrades over 
> time and had a very hard time to find out why.
> Today I took a closer look. When using visualvm to sample cpu it already 
> stood out that the longer the cluster was running, the more time was spent in 
> 'SecondPrecisionEventContainer.generateReport()' during web requests. This 
> method is already relied on a lot right after starting the cluster (for big 
> flows and process groups). But the time spent in it increases (in our setup) 
> the longer the cluster runs. This increases latency of almost every web 
> request. Our flow reconfiguration script (calling many NiFi API endpoints) 
> went from 2 minutes to 20 minutes run time in a few days.
>  Looking at the source code I couldn't quite figure out why the run time 
> should increase over time, because the ring buffers always stay the same size 
> (301 entries|5 minutes).
> When sampling memory I noticed quite a lot of EventSum instances, more than 
> there should have been. So I took a heap dump and ran a MemoryAnalyzer tool. 
> The "Leak Suspects" overview gave me the final hint to what was wrong.
>  It reported:
> One instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by 
> "" occupies 5,649,926,328 (55.74%) bytes. The instance 
> is referenced by 
> org.apache.nifi.controller.repository.metrics.RingBufferEventRepository @ 
> 0x7f86c50cda40 , loaded by "org.apache.nifi.nar.NarClassLoader @ 
> 0x7f86a000". The memory is accumulated in one instance of 
> "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by " loader>".
> The issu