On 18 May 2015 at 20:10, Rob Miller <[email protected]> wrote: > On 05/18/2015 09:46 AM, bruno binet wrote: > >> Hi all, >> >> I'm using Heka to decode, filter, aggregate, and encode metrics to file, >> mainly using lua sandboxes: at the beginning heka works perfectly, >> consuming around 60% cpu cycles, but after ~40min the heka process >> become idle (0-1% cpu) and the following diagnostic messages starts to >> appear: >> >> 2015/05/18 16:20:07 Diagnostics: 38 packs have been idle more than 120 >> seconds. >> 2015/05/18 16:20:07 Diagnostics: (input) Plugin names and quantities >> found on idle packs: >> 2015/05/18 16:20:07 Diagnostics: >> Trwebclient1sLastAggregateMetricFilter: 38 >> 2015/05/18 16:20:07 Diagnostics: MainRegexDispatchFilter: 32 >> 2015/05/18 16:20:07 >> 2015/05/18 16:20:07 Diagnostics: 25 packs have been idle more than 120 >> seconds. >> 2015/05/18 16:20:07 Diagnostics: (inject) Plugin names and quantities >> found on idle packs: >> 2015/05/18 16:20:07 Diagnostics: 3sLastAggregateMetricFilter: 23 >> 2015/05/18 16:20:07 Diagnostics: 60sMinAggregateMetricFilter: 1 >> 2015/05/18 16:20:07 Diagnostics: 60sLastAggregateMetricFilter: 24 >> 2015/05/18 16:20:07 >> > This message means that some messages are being delivered to one or more > of your output / filter plugins, but they're never being recycled and freed > up for reuse. There are a few reasons this could be happening. A poorly > coded plugin that doesn't correctly recycle packs is one possible cause, > but if you're mainly using SandboxFilters and aren't using any custom Go > plugins then that's pretty unlikely. > > More likely is that you've got a deadlock happening, where possibly all of > the packs from one of your pools are in use. Sometimes you'll get a > situation where a plugin is blocked trying to get a new pack to populate, > but most of the packs are tied up in the input channels for that very > plugin, so everything comes to a halt. >
I see, so this is likely a deadlock issue. > >> Hekad is configured with >> [hekad] >> max_message_loops = 10 >> maxprocs = 4 >> >> Here is a heka report after heka was started, when everything was ok: >> https://gist.github.com/bbinet/becc0ef3a92ad5883230 >> >> And here is a heka report after ~40min, when heka was complaining about >> the idle packs: >> https://gist.github.com/bbinet/031ea86838ae501368c4 >> >> Any idea of what is going on? >> > Here's what I can see from the second report: > > - The router input channel is completely full, which verifies that the > router is backed up, so no new messages can get to the router. > > - The `injectRecycleChan` is completely empty. That means that all 100 of > those packs are in use, and any filter that is trying to inject a message > will be blocked waiting for one of the packs in the injection pool to be > freed up. > Ok. For my information, what is the difference between `inputRecycleChan` and `injectRecycleChan`? > - Both the input and matcher channel for the `MainRegexDispatchFilter` are > completely full, which means that this is the filter that is blocking > everything. Probably this filter is blocked trying to inject a message, > which it can't do because the injectRecycleChan is empty. Both channels > being full accounts for 60 messages being tied up, which I'm guessing are > 60 out of the 100 in our injection pool. > For the MainRegexDispatchFilter, what is the difference between InChanCapacity and MatchChanCapacity? > This doesn't get us all the way there, though. I'm guessing that the log > output about '38 packs have been idle more than 120 seconds' is a clue > about where most of the other 40 packs are. I'd need to see the code in > those filters, and the config that's wiring it all together, to dig further. > Here is the repository for the filters and configuration in use: https://github.com/bbinet/heka-hl-sandboxes And here is the MainRegexDispatchFilter code and config: https://github.com/bbinet/heka-hl-sandboxes/blob/master/filters/regex_dispatch_metric.lua https://github.com/bbinet/heka-hl-sandboxes/blob/master/toml/filter_dispatch.toml#L33-74 Do you see anything wrong here? > Why does it work during ~40min before it starts to fail? >> > There are a couple of possibilities. If it's packs not being recycled > that's the problem, then this would mean there's a slow drip. First you > lose one pack, then another, then another, until after 40 minutes enough of > them are unavailable that pack exhaustion becomes a problem and you hit the > deadlock. > I think it's not the case here because it can fail sometimes much sooner or longer than 40min. > Another possibility is that things are working fine until the Go scheduler > happens to land you in an unlucky place. Imagine that you have a filter > (filter A) that is injecting a lot of messages, all of which will be picked > up by another filter (filter B) that will in turn inject more messages. If > both filters are given about equal time by the scheduler then everything > will flow smoothly. But if the first filter gets a longer chunk of time, > then it might consume all of the messages in the injection pool, most of > which will be backed up in filter B's input channels. Then the scheduler > switches to filter B, but it's now blocked because it needs a pack from the > injection pool, and there aren't any. > > My guess is that second scenario is what you're hitting. If I'm right, > then your problem might go away if you just increase the poolsize setting > (see > https://hekad.readthedocs.org/en/latest/config/index.html#global-configuration-options). > Try cranking it up to 200 and see if that has any impact, and let us know > how it goes. > Yes, it seems I'm facing this second scenario. I've increased the poolsize to 200 yesterday evening and Heka is now running smoothly for more than 17 hours. One more question: if I don't want to increase the poolsize value (to minimize the memory that heka consumes), is it equivalent to decrease the `plugin_chansize` value down to 15? Thanks a lot for your help! Bruno
_______________________________________________ Heka mailing list [email protected] https://mail.mozilla.org/listinfo/heka

