Re: [heka] Issues experienced with Heka in Production (continued)

Matt Bostock Mon, 13 Apr 2015 09:04:28 -0700

Thanks for sharing this Rob, I'll make a note so we know to check these
options should we try Heka again in future. And thanks again for your help.


--

Matt Bostock
Web Operations
Government Digital Service

e: [email protected]
t: +44 (0) 7917 173 573
a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH

On 10 April 2015 at 21:21, Rob Miller <[email protected]> wrote:

> I'm sorry to hear that you've still been having problems, and while I wish
> it weren't so, I definitely understand the need to table your efforts and
> revert.
>
> Even so, for posterity's sake, and in case you ever free up enough cycles
> to give it another go, there are a few settings I should mention that might
> have been helpful.
>
> The first is LogstreamerInput's `oldest_duration` setting (see
> http://hekad.readthedocs.org/en/v0.9.1/config/inputs/logstreamer.html).
> That lets you skip processing for files that have a modified time of older
> than the specified duration. It's not exactly the same as what you were
> accomplishing with your bootstrapJournals setting, but if you set it to
> something very low (i.e. a few seconds) the first time you start Heka
> against a machine with a large backlog you could probably achieve a similar
> result.
>
> The second ones are the `queue_max_buffer_size` and `queue_full_action`
> that were added to the TcpOutput in 0.9 (see http://hekad.readthedocs.org/
> en/v0.9.1/config/outputs/tcp.html). With these you can protect yourself
> from having the queue grow past a given size, as well as specify what
> should happen if the queue hits the high water mark. In your case, it seems
> like the log processing was happening more quickly than the TcpOutput could
> ship the data. If the `queue_full_action` was set to `block` then back
> pressure would apply and the log processing would pause while the TcpOutput
> had a chance to drain the queue for a while. This would likely be okay, as
> long as the logs weren't growing so quickly that Heka would never catch up.
>
> Unfortunately I don't have enough info to be able to figure out what's
> going on with the memory usage. We've not seen runaway memory usage in our
> deployments at Mozilla, and I haven't even been able to yet reproduce the
> memory leak that you were seeing with the DashboardOutput. In a perfect
> world I'd be able to turn on memory profiling (see
> http://hekad.readthedocs.org/en/v0.9.1/config/index.html#
> global-configuration-options) and run Heka until the RAM usage got out of
> hand to see if that shed any light on the matter, but, alas, it's not a
> perfect world, and your production environment is hardly the right place to
> be performing such experiments.
>
> Anyway, thanks for the post-mortem, and sorry again that you hit these
> issues.
>
> -r
>
>
>
> On 04/10/2015 09:08 AM, Matt Bostock wrote:
>
>> Following on from:
>> https://mail.mozilla.org/pipermail/heka/2015-April/000451.html
>>
>> Thanks for taking the time to write the Lua plugin. We tried it in our
>> Staging environment and we started receiving metrics again for
>> high-traffic machines.
>>
>> We saw that the plugin was exceeded the 8MB sandbox memory limit on some
>> machines, but as you mentioned that can easily be tweaked in the
>> configuration.
>>
>> Unfortunately, we saw hekad use significant amounts of memory and also
>> buffer heavily to disk for the TCP remote plugin such that it filled the
>> partition on many of our boxes in that environment (easily fixed as it's
>> a staging environment).
>>
>> Please see the attached graph of hekad memory usage for our Staging
>> environment. The spike occured when Heka was restarted to use the new
>> Lua plugin; the drop-off occurs when we stopped Heka on all machines.
>>
>> I wonder if the back pressure issue was related to the fact that Heka,
>> by default, seems to try to read in all previous log entries. We
>> initially worked around that issue by bootstrapping the journals to
>> point to the end of each log file:
>> https://github.com/mozilla-services/heka/compare/dev...
>> dcarley:logstreamer_bootstrap_journals
>>
>> Due to current priorities and the amount of time we've gone without
>> centralised metrics or logging for some machines, we have had to remove
>> Heka and revert to our previous solution.
>>
>> For anyone who, reading this, is considering Heka, I don't think we
>> would discount looking at it again in future. I personally hope we can
>> use it again at some point. In our case, I think we may have adopted it
>> slightly too early and given our current priorities we are unable to
>> spare the resources to troubleshoot and resolve the issues we saw.
>>
>> Huge thanks to Rob and the Mozilla team who have been extremely
>> responsive and helpful.
>>
>> Thanks,
>> Matt
>>
>> --
>>
>> Matt Bostock
>> Web Operations
>> Government Digital Service
>>
>> e: [email protected]
>> <mailto:[email protected]>
>> t: +44 (0) 7917 173 573
>> a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH
>>
>>
>> _______________________________________________
>> Heka mailing list
>> [email protected]
>> https://mail.mozilla.org/listinfo/heka
>>
>>
> _______________________________________________
> Heka mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/heka
>

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] Issues experienced with Heka in Production (continued)

Reply via email to