Oof. It's never fun to hear about folks having problems using your software, 
and it's even worse when the headaches are enough to make you want to roll back 
your deployment. Even so, I appreciate you taking the time to write up this 
post-mortem; we're very motivated to resolve the issues that have been causing 
you pain. More on the details in-line below:

On 03/17/2015 11:36 AM, Matt Bostock wrote:
*## Memory leak from dashboard plugin*

We initially saw Heka use a significant amount of memory over time, more
than 3GB in some cases. You can see this is the attached graph which
shows Heka's resident memory usage over the past 6 weeks across
approximately 100 servers. The drops show when we either restarted Heka
or when the OOM killer killed the process.
Okay, we'll take a look and will resolve this leak. I've opened an issue for 
tracking: https://github.com/mozilla-services/heka/issues/1422.
*## Idle packs; logs and metrics not received for some machines*

While metrics and logs are successfully received by our Graphite and
Elasticsearch machines for our application servers,*logs and metrics are
not being received for certain high-traffic servers in our stack* (i.e.
cache/router machines). This is the issue that has caused us the most pain.

We see this warning for idle packs recurring in our logs:

https://gist.github.com/mattbostock/241ae690a4c09acd8145

In a bid to try to alleviate some of the pressure on the plugins, we
tried increasing plugin max chan size from 30 to 300 (as you can see in
the configuration linked to above). Just for reference. this did not
have any significant impact on Heka's memory usage.
No, it wouldn't. You're changing the size of the input channels for all of the 
plugins, which doesn't really change memory consumption. Changing the pool size 
would make a difference, since that changes the number of message structs that 
are allocated.
This is the output of from Heka's reporting mechanism `sudo kill
-SIGUSR1 $(pgrep hekad)` on one of our high-traffic cache nodes and the
central logging box which receives the logs and pushes them into to
Elasticsearch:

https://gist.github.com/mattbostock/ed5213b0607effc6813d

The Heka logs on the central logging node are otherwise mostly empty
apart from the occasional Elasticsearch connection error (i.e. once a
day or less).

As you can see from the reports linked, inputRecycleChan
and injectRecycleChan are both at full capacity (100). Could this be due
to a failure in one of the plugins?
The "capacity" of a channel is the number of slots that a channel has, that remains the 
same for the entire life of the Heka process. The "length" of a channel is how many items 
are in the channel at a given time, and that's constantly changing, obviously. So when I look at 
your report from the cache node I can see that the inputRecycleChan is completely empty. Also, 
looking a bit further down, pretty much all of the packs are sitting on the nginx_access_stat 
filters input channel, which has a length of 99. So clearly the StatFilter is what's gumming up the 
works.

You've got a deadlock happening. Because your pool_size is the default of 100, 
that means you have 100 packs (i.e. messages) for all of the inputs to use. But 
you've cranked your plugin channel size all the way up to 300. Currently 99 of 
your 100 packs are sitting on the StatFilter's InChan. The other pack is inside 
the stat filter itself, being currently processed. The StatFilter is extracting 
data from it, and it's trying to hand that data to the StatAccumInput to be 
tallied up.

The StatAccumInput, however, is trying to flush the last interval of aggregated 
data. To do that, it needs a pack from the inputRecycleChan. Which is empty. 
Because all of the packs are tied up in the StatFilter's input channel. Which 
can't move until the StatFilter can give a stat to the StatAccumInput. Rinse, 
lather, repeat.

So you're probably seeing by now that cranking the chan size up to 300 wasn't 
the right choice. You don't really want a situation where it's easy for all of 
your packs to be tied up in a single plugin's channel like that. There are two 
possible ways I can see to resolve this issue. The first is trivially easy, but 
it might not be sufficient. The second takes a bit more work, but I can help if 
it comes to that, and you're willing to give it another go.

The easy solution is to remove the plugin_chansize change, and to bump your 
global pool_size setting up to about 130. This will mean that even if the match 
chan and the in chan for your StatFilter and your TcpOutput are completely 
full, there will still be some packs left for the StatAccumInput to use. If you 
try this and data continues to flow w/o getting gummed up, great.

Unfortunately, it might be the case that you're producing data more quickly 
than the StatFilter can process it. The StatFilter works, but it's trying to 
handle the general case and it's not very efficient, alas. If it can't handle 
the volume of messages it's receiving, then there's gonna be backpressure, and 
tweaking the pool and channel sizes isn't going to help.

If that's the case, the other option is to stop using the StatFilter and 
StatAccumInput altogether. Instead we can write some Lua in a SandboxFilter 
that will catch the same messages the StatFilter is currently catching, much 
more efficiently extract only the data you care about from those messages, and 
periodically emit the accumulated tallies in the same format that is currently 
being generated by the StatAccumInput. This would probably be about 30 minutes 
of work, if you're interested in trying it out I'd be happy to whip something 
up for you to try out.

Anyway, sorry for the headaches, and thanks again for letting us know about 
them.

Cheers!

-r

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Reply via email to