[ https://issues.apache.org/jira/browse/STORM-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Joseph Evans resolved STORM-2786. ---------------------------------------- Resolution: Fixed Fix Version/s: 1.0.6 1.1.2 1.2.0 2.0.0 Merged the fix into master, 1.x 1.1.x and 1.0.x > Ackers leak tracking info on failure and lots of other cases. > ------------------------------------------------------------- > > Key: STORM-2786 > URL: https://issues.apache.org/jira/browse/STORM-2786 > Project: Apache Storm > Issue Type: Bug > Components: storm-client, storm-core > Affects Versions: 0.9.1-incubating, 0.10.0, 1.0.0, 2.0.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Critical > Labels: pull-request-available > Fix For: 2.0.0, 1.2.0, 1.1.2, 1.0.6 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Over the weekend we had an incident where ackers were running out of memory > at a really scary rate. It turns out that they were having a lot of > failures, for an unrelated reason, but each of the failures were resulting in > tuple tracking being lost because... > We don't send ticks to any system components ever... > https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L384 > and ackers are system components. > So the tracking map was never rotated and all failed tuples > https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/daemon/Acker.java#L97-L103 > Were never deleted from the map. > This leak eventually made the ackers crash, and when they came back up the > other components kept blasting them with messages that would never be fully > acked which also leaked because of the tick problem. > Looking back this has been in every release since 0.9.1-incubating. It > appears to have been introduced by > https://github.com/apache/storm/commit/483ce454a3b2cd31b5d1c34e9365346459b358a8 > So every apache release has this problem (which is the only reason I have not > marked this as a blocker, because apparently it is not so bad that anyone has > noticed in the past 4 years). -- This message was sent by Atlassian JIRA (v6.4.14#64029)