Re: [heka] Issues experienced with Heka in Production

Rob Miller Tue, 17 Mar 2015 15:32:28 -0700

Oof. It's never fun to hear about folks having problems using your software, 
and it's even worse when the headaches are enough to make you want to roll back 
your deployment. Even so, I appreciate you taking the time to write up this 
post-mortem; we're very motivated to resolve the issues that have been causing 
you pain. More on the details in-line below:


On 03/17/2015 11:36 AM, Matt Bostock wrote:

*## Memory leak from dashboard plugin*

We initially saw Heka use a significant amount of memory over time, more
than 3GB in some cases. You can see this is the attached graph which
shows Heka's resident memory usage over the past 6 weeks across
approximately 100 servers. The drops show when we either restarted Heka
or when the OOM killer killed the process.

Okay, we'll take a look and will resolve this leak. I've opened an issue for 
tracking: https://github.com/mozilla-services/heka/issues/1422.

*## Idle packs; logs and metrics not received for some machines*

While metrics and logs are successfully received by our Graphite and
Elasticsearch machines for our application servers,*logs and metrics are
not being received for certain high-traffic servers in our stack* (i.e.
cache/router machines). This is the issue that has caused us the most pain.

We see this warning for idle packs recurring in our logs:

https://gist.github.com/mattbostock/241ae690a4c09acd8145

In a bid to try to alleviate some of the pressure on the plugins, we
tried increasing plugin max chan size from 30 to 300 (as you can see in
the configuration linked to above). Just for reference. this did not
have any significant impact on Heka's memory usage.

No, it wouldn't. You're changing the size of the input channels for all of the 
plugins, which doesn't really change memory consumption. Changing the pool size 
would make a difference, since that changes the number of message structs that 
are allocated.

This is the output of from Heka's reporting mechanism `sudo kill
-SIGUSR1 $(pgrep hekad)` on one of our high-traffic cache nodes and the
central logging box which receives the logs and pushes them into to
Elasticsearch:

https://gist.github.com/mattbostock/ed5213b0607effc6813d

The Heka logs on the central logging node are otherwise mostly empty
apart from the occasional Elasticsearch connection error (i.e. once a
day or less).

As you can see from the reports linked, inputRecycleChan
and injectRecycleChan are both at full capacity (100). Could this be due
to a failure in one of the plugins?

The "capacity" of a channel is the number of slots that a channel has, that remains the
same for the entire life of the Heka process. The "length" of a channel is how many items
are in the channel at a given time, and that's constantly changing, obviously. So when I look at
your report from the cache node I can see that the inputRecycleChan is completely empty. Also,
looking a bit further down, pretty much all of the packs are sitting on the nginx_access_stat
filters input channel, which has a length of 99. So clearly the StatFilter is what's gumming up the
works.

You've got a deadlock happening. Because your pool_size is the default of 100,
that means you have 100 packs (i.e. messages) for all of the inputs to use. But
you've cranked your plugin channel size all the way up to 300. Currently 99 of
your 100 packs are sitting on the StatFilter's InChan. The other pack is inside
the stat filter itself, being currently processed. The StatFilter is extracting
data from it, and it's trying to hand that data to the StatAccumInput to be
tallied up.

The StatAccumInput, however, is trying to flush the last interval of aggregated
data. To do that, it needs a pack from the inputRecycleChan. Which is empty.
Because all of the packs are tied up in the StatFilter's input channel. Which
can't move until the StatFilter can give a stat to the StatAccumInput. Rinse,
lather, repeat.

So you're probably seeing by now that cranking the chan size up to 300 wasn't
the right choice. You don't really want a situation where it's easy for all of
your packs to be tied up in a single plugin's channel like that. There are two
possible ways I can see to resolve this issue. The first is trivially easy, but
it might not be sufficient. The second takes a bit more work, but I can help if
it comes to that, and you're willing to give it another go.

The easy solution is to remove the plugin_chansize change, and to bump your
global pool_size setting up to about 130. This will mean that even if the match
chan and the in chan for your StatFilter and your TcpOutput are completely
full, there will still be some packs left for the StatAccumInput to use. If you
try this and data continues to flow w/o getting gummed up, great.

Unfortunately, it might be the case that you're producing data more quickly
than the StatFilter can process it. The StatFilter works, but it's trying to
handle the general case and it's not very efficient, alas. If it can't handle
the volume of messages it's receiving, then there's gonna be backpressure, and
tweaking the pool and channel sizes isn't going to help.

If that's the case, the other option is to stop using the StatFilter and
StatAccumInput altogether. Instead we can write some Lua in a SandboxFilter
that will catch the same messages the StatFilter is currently catching, much
more efficiently extract only the data you care about from those messages, and
periodically emit the accumulated tallies in the same format that is currently
being generated by the StatAccumInput. This would probably be about 30 minutes
of work, if you're interested in trying it out I'd be happy to whip something
up for you to try out.

Anyway, sorry for the headaches, and thanks again for letting us know about
them.

Cheers!

-r

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] Issues experienced with Heka in Production

Reply via email to