On 05/18/2015 09:46 AM, bruno binet wrote:
Hi all,
I'm using Heka to decode, filter, aggregate, and encode metrics to file,
mainly using lua sandboxes: at the beginning heka works perfectly,
consuming around 60% cpu cycles, but after ~40min the heka process
become idle (0-1% cpu) and the following diagnostic messages starts to
appear:
2015/05/18 16:20:07 Diagnostics: 38 packs have been idle more than 120
seconds.
2015/05/18 16:20:07 Diagnostics: (input) Plugin names and quantities
found on idle packs:
2015/05/18 16:20:07 Diagnostics:
Trwebclient1sLastAggregateMetricFilter: 38
2015/05/18 16:20:07 Diagnostics: MainRegexDispatchFilter: 32
2015/05/18 16:20:07
2015/05/18 16:20:07 Diagnostics: 25 packs have been idle more than 120
seconds.
2015/05/18 16:20:07 Diagnostics: (inject) Plugin names and quantities
found on idle packs:
2015/05/18 16:20:07 Diagnostics: 3sLastAggregateMetricFilter: 23
2015/05/18 16:20:07 Diagnostics: 60sMinAggregateMetricFilter: 1
2015/05/18 16:20:07 Diagnostics: 60sLastAggregateMetricFilter: 24
2015/05/18 16:20:07
This message means that some messages are being delivered to one or more of
your output / filter plugins, but they're never being recycled and freed up for
reuse. There are a few reasons this could be happening. A poorly coded plugin
that doesn't correctly recycle packs is one possible cause, but if you're
mainly using SandboxFilters and aren't using any custom Go plugins then that's
pretty unlikely.
More likely is that you've got a deadlock happening, where possibly all of the
packs from one of your pools are in use. Sometimes you'll get a situation where
a plugin is blocked trying to get a new pack to populate, but most of the packs
are tied up in the input channels for that very plugin, so everything comes to
a halt.
Hekad is configured with
[hekad]
max_message_loops = 10
maxprocs = 4
Here is a heka report after heka was started, when everything was ok:
https://gist.github.com/bbinet/becc0ef3a92ad5883230
And here is a heka report after ~40min, when heka was complaining about
the idle packs:
https://gist.github.com/bbinet/031ea86838ae501368c4
Any idea of what is going on?
Here's what I can see from the second report:
- The router input channel is completely full, which verifies that the router
is backed up, so no new messages can get to the router.
- The `injectRecycleChan` is completely empty. That means that all 100 of those
packs are in use, and any filter that is trying to inject a message will be
blocked waiting for one of the packs in the injection pool to be freed up.
- Both the input and matcher channel for the `MainRegexDispatchFilter` are
completely full, which means that this is the filter that is blocking
everything. Probably this filter is blocked trying to inject a message, which
it can't do because the injectRecycleChan is empty. Both channels being full
accounts for 60 messages being tied up, which I'm guessing are 60 out of the
100 in our injection pool.
This doesn't get us all the way there, though. I'm guessing that the log output
about '38 packs have been idle more than 120 seconds' is a clue about where
most of the other 40 packs are. I'd need to see the code in those filters, and
the config that's wiring it all together, to dig further.
Why does it work during ~40min before it starts to fail?
There are a couple of possibilities. If it's packs not being recycled that's
the problem, then this would mean there's a slow drip. First you lose one pack,
then another, then another, until after 40 minutes enough of them are
unavailable that pack exhaustion becomes a problem and you hit the deadlock.
Another possibility is that things are working fine until the Go scheduler
happens to land you in an unlucky place. Imagine that you have a filter (filter
A) that is injecting a lot of messages, all of which will be picked up by
another filter (filter B) that will in turn inject more messages. If both
filters are given about equal time by the scheduler then everything will flow
smoothly. But if the first filter gets a longer chunk of time, then it might
consume all of the messages in the injection pool, most of which will be backed
up in filter B's input channels. Then the scheduler switches to filter B, but
it's now blocked because it needs a pack from the injection pool, and there
aren't any.
My guess is that second scenario is what you're hitting. If I'm right, then
your problem might go away if you just increase the poolsize setting (see
https://hekad.readthedocs.org/en/latest/config/index.html#global-configuration-options).
Try cranking it up to 200 and see if that has any impact, and let us know how
it goes.
-r
Thanks,
Bruno
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka