Thanks for sharing this Rob, I'll make a note so we know to check these options should we try Heka again in future. And thanks again for your help.
-- Matt Bostock Web Operations Government Digital Service e: [email protected] t: +44 (0) 7917 173 573 a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH On 10 April 2015 at 21:21, Rob Miller <[email protected]> wrote: > I'm sorry to hear that you've still been having problems, and while I wish > it weren't so, I definitely understand the need to table your efforts and > revert. > > Even so, for posterity's sake, and in case you ever free up enough cycles > to give it another go, there are a few settings I should mention that might > have been helpful. > > The first is LogstreamerInput's `oldest_duration` setting (see > http://hekad.readthedocs.org/en/v0.9.1/config/inputs/logstreamer.html). > That lets you skip processing for files that have a modified time of older > than the specified duration. It's not exactly the same as what you were > accomplishing with your bootstrapJournals setting, but if you set it to > something very low (i.e. a few seconds) the first time you start Heka > against a machine with a large backlog you could probably achieve a similar > result. > > The second ones are the `queue_max_buffer_size` and `queue_full_action` > that were added to the TcpOutput in 0.9 (see http://hekad.readthedocs.org/ > en/v0.9.1/config/outputs/tcp.html). With these you can protect yourself > from having the queue grow past a given size, as well as specify what > should happen if the queue hits the high water mark. In your case, it seems > like the log processing was happening more quickly than the TcpOutput could > ship the data. If the `queue_full_action` was set to `block` then back > pressure would apply and the log processing would pause while the TcpOutput > had a chance to drain the queue for a while. This would likely be okay, as > long as the logs weren't growing so quickly that Heka would never catch up. > > Unfortunately I don't have enough info to be able to figure out what's > going on with the memory usage. We've not seen runaway memory usage in our > deployments at Mozilla, and I haven't even been able to yet reproduce the > memory leak that you were seeing with the DashboardOutput. In a perfect > world I'd be able to turn on memory profiling (see > http://hekad.readthedocs.org/en/v0.9.1/config/index.html# > global-configuration-options) and run Heka until the RAM usage got out of > hand to see if that shed any light on the matter, but, alas, it's not a > perfect world, and your production environment is hardly the right place to > be performing such experiments. > > Anyway, thanks for the post-mortem, and sorry again that you hit these > issues. > > -r > > > > On 04/10/2015 09:08 AM, Matt Bostock wrote: > >> Following on from: >> https://mail.mozilla.org/pipermail/heka/2015-April/000451.html >> >> Thanks for taking the time to write the Lua plugin. We tried it in our >> Staging environment and we started receiving metrics again for >> high-traffic machines. >> >> We saw that the plugin was exceeded the 8MB sandbox memory limit on some >> machines, but as you mentioned that can easily be tweaked in the >> configuration. >> >> Unfortunately, we saw hekad use significant amounts of memory and also >> buffer heavily to disk for the TCP remote plugin such that it filled the >> partition on many of our boxes in that environment (easily fixed as it's >> a staging environment). >> >> Please see the attached graph of hekad memory usage for our Staging >> environment. The spike occured when Heka was restarted to use the new >> Lua plugin; the drop-off occurs when we stopped Heka on all machines. >> >> I wonder if the back pressure issue was related to the fact that Heka, >> by default, seems to try to read in all previous log entries. We >> initially worked around that issue by bootstrapping the journals to >> point to the end of each log file: >> https://github.com/mozilla-services/heka/compare/dev... >> dcarley:logstreamer_bootstrap_journals >> >> Due to current priorities and the amount of time we've gone without >> centralised metrics or logging for some machines, we have had to remove >> Heka and revert to our previous solution. >> >> For anyone who, reading this, is considering Heka, I don't think we >> would discount looking at it again in future. I personally hope we can >> use it again at some point. In our case, I think we may have adopted it >> slightly too early and given our current priorities we are unable to >> spare the resources to troubleshoot and resolve the issues we saw. >> >> Huge thanks to Rob and the Mozilla team who have been extremely >> responsive and helpful. >> >> Thanks, >> Matt >> >> -- >> >> Matt Bostock >> Web Operations >> Government Digital Service >> >> e: [email protected] >> <mailto:[email protected]> >> t: +44 (0) 7917 173 573 >> a: 6th Floor, Aviation House, 125 Kingsway, London, WC2B 6NH >> >> >> _______________________________________________ >> Heka mailing list >> [email protected] >> https://mail.mozilla.org/listinfo/heka >> >> > _______________________________________________ > Heka mailing list > [email protected] > https://mail.mozilla.org/listinfo/heka >
_______________________________________________ Heka mailing list [email protected] https://mail.mozilla.org/listinfo/heka

