I'm sorry to hear that you've still been having problems, and while I wish it weren't so, I definitely understand the need to table your efforts and revert.
Even so, for posterity's sake, and in case you ever free up enough cycles to give it another go, there are a few settings I should mention that might have been helpful. The first is LogstreamerInput's `oldest_duration` setting (see http://hekad.readthedocs.org/en/v0.9.1/config/inputs/logstreamer.html). That lets you skip processing for files that have a modified time of older than the specified duration. It's not exactly the same as what you were accomplishing with your bootstrapJournals setting, but if you set it to something very low (i.e. a few seconds) the first time you start Heka against a machine with a large backlog you could probably achieve a similar result. The second ones are the `queue_max_buffer_size` and `queue_full_action` that were added to the TcpOutput in 0.9 (see http://hekad.readthedocs.org/en/v0.9.1/config/outputs/tcp.html). With these you can protect yourself from having the queue grow past a given size, as well as specify what should happen if the queue hits the high water mark. In your case, it seems like the log processing was happening more quickly than the TcpOutput could ship the data. If the `queue_full_action` was set to `block` then back pressure would apply and the log processing would pause while the TcpOutput had a chance to drain the queue for a while. This would likely be okay, as long as the logs weren't growing so quickly that Heka would never catch up. Unfortunately I don't have enough info to be able to figure out what's going on with the memory usage. We've not seen runaway memory usage in our deployments at Mozilla, and I haven't even been able to yet reproduce the memory leak that you were seeing with the DashboardOutput. In a perfect world I'd be able to turn on memory profiling (see http://hekad.readthedocs.org/en/v0.9.1/config/index.html#global-configuration-options) and run Heka until the RAM usage got out of hand to see if that shed any light on the matter, but, alas, it's not a perfect world, and your production environment is hardly the right place to be performing such experiments. Anyway, thanks for the post-mortem, and sorry again that you hit these issues. -r On 04/10/2015 09:08 AM, Matt Bostock wrote:
Following on from: https://mail.mozilla.org/pipermail/heka/2015-April/000451.html Thanks for taking the time to write the Lua plugin. We tried it in our Staging environment and we started receiving metrics again for high-traffic machines. We saw that the plugin was exceeded the 8MB sandbox memory limit on some machines, but as you mentioned that can easily be tweaked in the configuration. Unfortunately, we saw hekad use significant amounts of memory and also buffer heavily to disk for the TCP remote plugin such that it filled the partition on many of our boxes in that environment (easily fixed as it's a staging environment). Please see the attached graph of hekad memory usage for our Staging environment. The spike occured when Heka was restarted to use the new Lua plugin; the drop-off occurs when we stopped Heka on all machines. I wonder if the back pressure issue was related to the fact that Heka, by default, seems to try to read in all previous log entries. We initially worked around that issue by bootstrapping the journals to point to the end of each log file: https://github.com/mozilla-services/heka/compare/dev...dcarley:logstreamer_bootstrap_journals Due to current priorities and the amount of time we've gone without centralised metrics or logging for some machines, we have had to remove Heka and revert to our previous solution. For anyone who, reading this, is considering Heka, I don't think we would discount looking at it again in future. I personally hope we can use it again at some point. In our case, I think we may have adopted it slightly too early and given our current priorities we are unable to spare the resources to troubleshoot and resolve the issues we saw. Huge thanks to Rob and the Mozilla team who have been extremely responsive and helpful. Thanks, Matt -- Matt Bostock Web Operations Government Digital Service e: [email protected] <mailto:[email protected]> t: +44 (0) 7917 173 573 a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH _______________________________________________ Heka mailing list [email protected] https://mail.mozilla.org/listinfo/heka
_______________________________________________ Heka mailing list [email protected] https://mail.mozilla.org/listinfo/heka

