Hi Rob, Thanks for taking the time to respond, that's very helpful. I had wondered why nginx_access_stat's input channel was 99 and not 100; that's much clearer to me now :-)
That's an interesting point about the StatFilter design. I'm curious, would a customised StatFilter plugin be more performance if written in Go? I guess Lua might be preferable to make use of the Sandbox and the reasons described at the top of this page: https://hekad.readthedocs.org/en/latest/sandbox/ I'm out of the office for a week from today but I have passed this on to my colleagues and they'll be in touch if they have any follow-up questions. Thanks again, Matt -- Matt Bostock Web Operations Government Digital Service e: [email protected] t: +44 (0) 7917 173 573 a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH On 17 March 2015 at 22:31, Rob Miller <[email protected]> wrote: > Oof. It's never fun to hear about folks having problems using your > software, and it's even worse when the headaches are enough to make you > want to roll back your deployment. Even so, I appreciate you taking the > time to write up this post-mortem; we're very motivated to resolve the > issues that have been causing you pain. More on the details in-line below: > > On 03/17/2015 11:36 AM, Matt Bostock wrote: > >> *## Memory leak from dashboard plugin* >> >> We initially saw Heka use a significant amount of memory over time, more >> than 3GB in some cases. You can see this is the attached graph which >> shows Heka's resident memory usage over the past 6 weeks across >> approximately 100 servers. The drops show when we either restarted Heka >> or when the OOM killer killed the process. >> > Okay, we'll take a look and will resolve this leak. I've opened an issue > for tracking: https://github.com/mozilla-services/heka/issues/1422. > >> *## Idle packs; logs and metrics not received for some machines* >> >> While metrics and logs are successfully received by our Graphite and >> Elasticsearch machines for our application servers,*logs and metrics are >> not being received for certain high-traffic servers in our stack* (i.e. >> cache/router machines). This is the issue that has caused us the most >> pain. >> >> We see this warning for idle packs recurring in our logs: >> >> https://gist.github.com/mattbostock/241ae690a4c09acd8145 >> >> In a bid to try to alleviate some of the pressure on the plugins, we >> tried increasing plugin max chan size from 30 to 300 (as you can see in >> the configuration linked to above). Just for reference. this did not >> have any significant impact on Heka's memory usage. >> > No, it wouldn't. You're changing the size of the input channels for all of > the plugins, which doesn't really change memory consumption. Changing the > pool size would make a difference, since that changes the number of message > structs that are allocated. > >> This is the output of from Heka's reporting mechanism `sudo kill >> -SIGUSR1 $(pgrep hekad)` on one of our high-traffic cache nodes and the >> central logging box which receives the logs and pushes them into to >> Elasticsearch: >> >> https://gist.github.com/mattbostock/ed5213b0607effc6813d >> >> The Heka logs on the central logging node are otherwise mostly empty >> apart from the occasional Elasticsearch connection error (i.e. once a >> day or less). >> >> As you can see from the reports linked, inputRecycleChan >> and injectRecycleChan are both at full capacity (100). Could this be due >> to a failure in one of the plugins? >> > The "capacity" of a channel is the number of slots that a channel has, > that remains the same for the entire life of the Heka process. The "length" > of a channel is how many items are in the channel at a given time, and > that's constantly changing, obviously. So when I look at your report from > the cache node I can see that the inputRecycleChan is completely empty. > Also, looking a bit further down, pretty much all of the packs are sitting > on the nginx_access_stat filters input channel, which has a length of 99. > So clearly the StatFilter is what's gumming up the works. > > You've got a deadlock happening. Because your pool_size is the default of > 100, that means you have 100 packs (i.e. messages) for all of the inputs to > use. But you've cranked your plugin channel size all the way up to 300. > Currently 99 of your 100 packs are sitting on the StatFilter's InChan. The > other pack is inside the stat filter itself, being currently processed. The > StatFilter is extracting data from it, and it's trying to hand that data to > the StatAccumInput to be tallied up. > > The StatAccumInput, however, is trying to flush the last interval of > aggregated data. To do that, it needs a pack from the inputRecycleChan. > Which is empty. Because all of the packs are tied up in the StatFilter's > input channel. Which can't move until the StatFilter can give a stat to the > StatAccumInput. Rinse, lather, repeat. > > So you're probably seeing by now that cranking the chan size up to 300 > wasn't the right choice. You don't really want a situation where it's easy > for all of your packs to be tied up in a single plugin's channel like that. > There are two possible ways I can see to resolve this issue. The first is > trivially easy, but it might not be sufficient. The second takes a bit more > work, but I can help if it comes to that, and you're willing to give it > another go. > > The easy solution is to remove the plugin_chansize change, and to bump > your global pool_size setting up to about 130. This will mean that even if > the match chan and the in chan for your StatFilter and your TcpOutput are > completely full, there will still be some packs left for the StatAccumInput > to use. If you try this and data continues to flow w/o getting gummed up, > great. > > Unfortunately, it might be the case that you're producing data more > quickly than the StatFilter can process it. The StatFilter works, but it's > trying to handle the general case and it's not very efficient, alas. If it > can't handle the volume of messages it's receiving, then there's gonna be > backpressure, and tweaking the pool and channel sizes isn't going to help. > > If that's the case, the other option is to stop using the StatFilter and > StatAccumInput altogether. Instead we can write some Lua in a SandboxFilter > that will catch the same messages the StatFilter is currently catching, > much more efficiently extract only the data you care about from those > messages, and periodically emit the accumulated tallies in the same format > that is currently being generated by the StatAccumInput. This would > probably be about 30 minutes of work, if you're interested in trying it out > I'd be happy to whip something up for you to try out. > > Anyway, sorry for the headaches, and thanks again for letting us know > about them. > > Cheers! > > -r > >
_______________________________________________ Heka mailing list [email protected] https://mail.mozilla.org/listinfo/heka

