Re: [heka] Issues experienced with Heka in Production

Matt Bostock Wed, 18 Mar 2015 07:45:06 -0700

Hi Rob,

Thanks for taking the time to respond, that's very helpful. I had wondered
why nginx_access_stat's input channel was 99 and not 100; that's much
clearer to me now :-)


That's an interesting point about the StatFilter design. I'm curious, would
a customised StatFilter plugin be more performance if written in Go? I
guess Lua might be preferable to make use of the Sandbox and the reasons
described at the top of this page:

https://hekad.readthedocs.org/en/latest/sandbox/

I'm out of the office for a week from today but I have passed this on to my
colleagues and they'll be in touch if they have any follow-up questions.

Thanks again,
Matt

--

Matt Bostock
Web Operations
Government Digital Service

e: [email protected]
t: +44 (0) 7917 173 573
a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH

On 17 March 2015 at 22:31, Rob Miller <[email protected]> wrote:

> Oof. It's never fun to hear about folks having problems using your
> software, and it's even worse when the headaches are enough to make you
> want to roll back your deployment. Even so, I appreciate you taking the
> time to write up this post-mortem; we're very motivated to resolve the
> issues that have been causing you pain. More on the details in-line below:
>
> On 03/17/2015 11:36 AM, Matt Bostock wrote:
>
>> *## Memory leak from dashboard plugin*
>>
>> We initially saw Heka use a significant amount of memory over time, more
>> than 3GB in some cases. You can see this is the attached graph which
>> shows Heka's resident memory usage over the past 6 weeks across
>> approximately 100 servers. The drops show when we either restarted Heka
>> or when the OOM killer killed the process.
>>
> Okay, we'll take a look and will resolve this leak. I've opened an issue
> for tracking: https://github.com/mozilla-services/heka/issues/1422.
>
>> *## Idle packs; logs and metrics not received for some machines*
>>
>> While metrics and logs are successfully received by our Graphite and
>> Elasticsearch machines for our application servers,*logs and metrics are
>> not being received for certain high-traffic servers in our stack* (i.e.
>> cache/router machines). This is the issue that has caused us the most
>> pain.
>>
>> We see this warning for idle packs recurring in our logs:
>>
>> https://gist.github.com/mattbostock/241ae690a4c09acd8145
>>
>> In a bid to try to alleviate some of the pressure on the plugins, we
>> tried increasing plugin max chan size from 30 to 300 (as you can see in
>> the configuration linked to above). Just for reference. this did not
>> have any significant impact on Heka's memory usage.
>>
> No, it wouldn't. You're changing the size of the input channels for all of
> the plugins, which doesn't really change memory consumption. Changing the
> pool size would make a difference, since that changes the number of message
> structs that are allocated.
>
>> This is the output of from Heka's reporting mechanism `sudo kill
>> -SIGUSR1 $(pgrep hekad)` on one of our high-traffic cache nodes and the
>> central logging box which receives the logs and pushes them into to
>> Elasticsearch:
>>
>> https://gist.github.com/mattbostock/ed5213b0607effc6813d
>>
>> The Heka logs on the central logging node are otherwise mostly empty
>> apart from the occasional Elasticsearch connection error (i.e. once a
>> day or less).
>>
>> As you can see from the reports linked, inputRecycleChan
>> and injectRecycleChan are both at full capacity (100). Could this be due
>> to a failure in one of the plugins?
>>
> The "capacity" of a channel is the number of slots that a channel has,
> that remains the same for the entire life of the Heka process. The "length"
> of a channel is how many items are in the channel at a given time, and
> that's constantly changing, obviously. So when I look at your report from
> the cache node I can see that the inputRecycleChan is completely empty.
> Also, looking a bit further down, pretty much all of the packs are sitting
> on the nginx_access_stat filters input channel, which has a length of 99.
> So clearly the StatFilter is what's gumming up the works.
>
> You've got a deadlock happening. Because your pool_size is the default of
> 100, that means you have 100 packs (i.e. messages) for all of the inputs to
> use. But you've cranked your plugin channel size all the way up to 300.
> Currently 99 of your 100 packs are sitting on the StatFilter's InChan. The
> other pack is inside the stat filter itself, being currently processed. The
> StatFilter is extracting data from it, and it's trying to hand that data to
> the StatAccumInput to be tallied up.
>
> The StatAccumInput, however, is trying to flush the last interval of
> aggregated data. To do that, it needs a pack from the inputRecycleChan.
> Which is empty. Because all of the packs are tied up in the StatFilter's
> input channel. Which can't move until the StatFilter can give a stat to the
> StatAccumInput. Rinse, lather, repeat.
>
> So you're probably seeing by now that cranking the chan size up to 300
> wasn't the right choice. You don't really want a situation where it's easy
> for all of your packs to be tied up in a single plugin's channel like that.
> There are two possible ways I can see to resolve this issue. The first is
> trivially easy, but it might not be sufficient. The second takes a bit more
> work, but I can help if it comes to that, and you're willing to give it
> another go.
>
> The easy solution is to remove the plugin_chansize change, and to bump
> your global pool_size setting up to about 130. This will mean that even if
> the match chan and the in chan for your StatFilter and your TcpOutput are
> completely full, there will still be some packs left for the StatAccumInput
> to use. If you try this and data continues to flow w/o getting gummed up,
> great.
>
> Unfortunately, it might be the case that you're producing data more
> quickly than the StatFilter can process it. The StatFilter works, but it's
> trying to handle the general case and it's not very efficient, alas. If it
> can't handle the volume of messages it's receiving, then there's gonna be
> backpressure, and tweaking the pool and channel sizes isn't going to help.
>
> If that's the case, the other option is to stop using the StatFilter and
> StatAccumInput altogether. Instead we can write some Lua in a SandboxFilter
> that will catch the same messages the StatFilter is currently catching,
> much more efficiently extract only the data you care about from those
> messages, and periodically emit the accumulated tallies in the same format
> that is currently being generated by the StatAccumInput. This would
> probably be about 30 minutes of work, if you're interested in trying it out
> I'd be happy to whip something up for you to try out.
>
> Anyway, sorry for the headaches, and thanks again for letting us know
> about them.
>
> Cheers!
>
> -r
>
>

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] Issues experienced with Heka in Production

Reply via email to