Re: [heka] Issues experienced with Heka in Production (continued)

Rob Miller Fri, 10 Apr 2015 13:22:09 -0700

I'm sorry to hear that you've still been having problems, and while I wish it 
weren't so, I definitely understand the need to table your efforts and revert.


Even so, for posterity's sake, and in case you ever free up enough cycles to 
give it another go, there are a few settings I should mention that might have 
been helpful.

The first is LogstreamerInput's `oldest_duration` setting (see 
http://hekad.readthedocs.org/en/v0.9.1/config/inputs/logstreamer.html). That 
lets you skip processing for files that have a modified time of older than the 
specified duration. It's not exactly the same as what you were accomplishing 
with your bootstrapJournals setting, but if you set it to something very low 
(i.e. a few seconds) the first time you start Heka against a machine with a 
large backlog you could probably achieve a similar result.

The second ones are the `queue_max_buffer_size` and `queue_full_action` that 
were added to the TcpOutput in 0.9 (see 
http://hekad.readthedocs.org/en/v0.9.1/config/outputs/tcp.html). With these you 
can protect yourself from having the queue grow past a given size, as well as 
specify what should happen if the queue hits the high water mark. In your case, 
it seems like the log processing was happening more quickly than the TcpOutput 
could ship the data. If the `queue_full_action` was set to `block` then back 
pressure would apply and the log processing would pause while the TcpOutput had 
a chance to drain the queue for a while. This would likely be okay, as long as 
the logs weren't growing so quickly that Heka would never catch up.

Unfortunately I don't have enough info to be able to figure out what's going on 
with the memory usage. We've not seen runaway memory usage in our deployments 
at Mozilla, and I haven't even been able to yet reproduce the memory leak that 
you were seeing with the DashboardOutput. In a perfect world I'd be able to 
turn on memory profiling (see 
http://hekad.readthedocs.org/en/v0.9.1/config/index.html#global-configuration-options)
 and run Heka until the RAM usage got out of hand to see if that shed any light 
on the matter, but, alas, it's not a perfect world, and your production 
environment is hardly the right place to be performing such experiments.

Anyway, thanks for the post-mortem, and sorry again that you hit these issues.

-r


On 04/10/2015 09:08 AM, Matt Bostock wrote:

Following on from:
https://mail.mozilla.org/pipermail/heka/2015-April/000451.html

Thanks for taking the time to write the Lua plugin. We tried it in our
Staging environment and we started receiving metrics again for
high-traffic machines.

We saw that the plugin was exceeded the 8MB sandbox memory limit on some
machines, but as you mentioned that can easily be tweaked in the
configuration.

Unfortunately, we saw hekad use significant amounts of memory and also
buffer heavily to disk for the TCP remote plugin such that it filled the
partition on many of our boxes in that environment (easily fixed as it's
a staging environment).

Please see the attached graph of hekad memory usage for our Staging
environment. The spike occured when Heka was restarted to use the new
Lua plugin; the drop-off occurs when we stopped Heka on all machines.

I wonder if the back pressure issue was related to the fact that Heka,
by default, seems to try to read in all previous log entries. We
initially worked around that issue by bootstrapping the journals to
point to the end of each log file:
https://github.com/mozilla-services/heka/compare/dev...dcarley:logstreamer_bootstrap_journals

Due to current priorities and the amount of time we've gone without
centralised metrics or logging for some machines, we have had to remove
Heka and revert to our previous solution.

For anyone who, reading this, is considering Heka, I don't think we
would discount looking at it again in future. I personally hope we can
use it again at some point. In our case, I think we may have adopted it
slightly too early and given our current priorities we are unable to
spare the resources to troubleshoot and resolve the issues we saw.

Huge thanks to Rob and the Mozilla team who have been extremely
responsive and helpful.

Thanks,
Matt

--

Matt Bostock
Web Operations
Government Digital Service

e: [email protected]
<mailto:[email protected]>
t: +44 (0) 7917 173 573
a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH


_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka


_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] Issues experienced with Heka in Production (continued)

Reply via email to