On 05/18/2015 09:46 AM, bruno binet wrote:
Hi all,

I'm using Heka to decode, filter, aggregate, and encode metrics to file,
mainly using lua sandboxes: at the beginning heka works perfectly,
consuming around 60% cpu cycles, but after ~40min the heka process
become idle (0-1% cpu) and the following diagnostic messages starts to
appear:

2015/05/18 16:20:07 Diagnostics: 38 packs have been idle more than 120
seconds.
2015/05/18 16:20:07 Diagnostics: (input) Plugin names and quantities
found on idle packs:
2015/05/18 16:20:07 Diagnostics:
Trwebclient1sLastAggregateMetricFilter: 38
2015/05/18 16:20:07 Diagnostics:        MainRegexDispatchFilter: 32
2015/05/18 16:20:07
2015/05/18 16:20:07 Diagnostics: 25 packs have been idle more than 120
seconds.
2015/05/18 16:20:07 Diagnostics: (inject) Plugin names and quantities
found on idle packs:
2015/05/18 16:20:07 Diagnostics:        3sLastAggregateMetricFilter: 23
2015/05/18 16:20:07 Diagnostics:        60sMinAggregateMetricFilter: 1
2015/05/18 16:20:07 Diagnostics:        60sLastAggregateMetricFilter: 24
2015/05/18 16:20:07
This message means that some messages are being delivered to one or more of 
your output / filter plugins, but they're never being recycled and freed up for 
reuse. There are a few reasons this could be happening. A poorly coded plugin 
that doesn't correctly recycle packs is one possible cause, but if you're 
mainly using SandboxFilters and aren't using any custom Go plugins then that's 
pretty unlikely.

More likely is that you've got a deadlock happening, where possibly all of the 
packs from one of your pools are in use. Sometimes you'll get a situation where 
a plugin is blocked trying to get a new pack to populate, but most of the packs 
are tied up in the input channels for that very plugin, so everything comes to 
a halt.

Hekad is configured with
[hekad]
max_message_loops = 10
maxprocs = 4

Here is a heka report after heka was started, when everything was ok:
https://gist.github.com/bbinet/becc0ef3a92ad5883230

And here is a heka report after ~40min, when heka was complaining about
the idle packs:
https://gist.github.com/bbinet/031ea86838ae501368c4

Any idea of what is going on?
Here's what I can see from the second report:

- The router input channel is completely full, which verifies that the router 
is backed up, so no new messages can get to the router.

- The `injectRecycleChan` is completely empty. That means that all 100 of those 
packs are in use, and any filter that is trying to inject a message will be 
blocked waiting for one of the packs in the injection pool to be freed up.

- Both the input and matcher channel for the `MainRegexDispatchFilter` are 
completely full, which means that this is the filter that is blocking 
everything. Probably this filter is blocked trying to inject a message, which 
it can't do because the injectRecycleChan is empty. Both channels being full 
accounts for 60 messages being tied up, which I'm guessing are 60 out of the 
100 in our injection pool.

This doesn't get us all the way there, though. I'm guessing that the log output 
about '38 packs have been idle more than 120 seconds' is a clue about where 
most of the other 40 packs are. I'd need to see the code in those filters, and 
the config that's wiring it all together, to dig further.
Why does it work during ~40min before it starts to fail?
There are a couple of possibilities. If it's packs not being recycled that's 
the problem, then this would mean there's a slow drip. First you lose one pack, 
then another, then another, until after 40 minutes enough of them are 
unavailable that pack exhaustion becomes a problem and you hit the deadlock.

Another possibility is that things are working fine until the Go scheduler 
happens to land you in an unlucky place. Imagine that you have a filter (filter 
A) that is injecting a lot of messages, all of which will be picked up by 
another filter (filter B) that will in turn inject more messages. If both 
filters are given about equal time by the scheduler then everything will flow 
smoothly. But if the first filter gets a longer chunk of time, then it might 
consume all of the messages in the injection pool, most of which will be backed 
up in filter B's input channels. Then the scheduler switches to filter B, but 
it's now blocked because it needs a pack from the injection pool, and there 
aren't any.

My guess is that second scenario is what you're hitting. If I'm right, then 
your problem might go away if you just increase the poolsize setting (see 
https://hekad.readthedocs.org/en/latest/config/index.html#global-configuration-options).
 Try cranking it up to 200 and see if that has any impact, and let us know how 
it goes.

-r

Thanks,
Bruno


_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka


_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Reply via email to