On 05/19/2015 03:16 AM, bruno binet wrote:
On 18 May 2015 at 20:10, Rob Miller <[email protected]
<mailto:[email protected]>> wrote:
- The `injectRecycleChan` is completely empty. That means that all
100 of those packs are in use, and any filter that is trying to
inject a message will be blocked waiting for one of the packs in the
injection pool to be freed up.
Ok. For my information, what is the difference between
`inputRecycleChan` and `injectRecycleChan`?
Input plugins get their unpopulated message structs from the inputRecycleChan.
Filter plugins get their unpopulated message structs from the
injectRecycleChan. When a message is recycled enough times that it's reference
count hits zero, its data will be cleared out and the pack will be returned to
the recycle channel from which it originated.
- Both the input and matcher channel for the
`MainRegexDispatchFilter` are completely full, which means that this
is the filter that is blocking everything. Probably this filter is
blocked trying to inject a message, which it can't do because the
injectRecycleChan is empty. Both channels being full accounts for 60
messages being tied up, which I'm guessing are 60 out of the 100 in
our injection pool.
For the MainRegexDispatchFilter, what is the difference between
InChanCapacity and MatchChanCapacity?
Every message matcher has a channel onto which the router places each message,
incrementing the message's reference count. The matchers pull from their
channels and check the message to see if there's a match. If no, then the
message is recycled, which decrements the ref count. If yes, the message is
placed on the corresponding filter or output plugin's channel. So the
MatchChan* values are related to the channel leading up to the message matcher,
while the InChan* values are related to the channel that is between the matcher
and the plugin.
This doesn't get us all the way there, though. I'm guessing that the
log output about '38 packs have been idle more than 120 seconds' is
a clue about where most of the other 40 packs are. I'd need to see
the code in those filters, and the config that's wiring it all
together, to dig further.
Here is the repository for the filters and configuration in use:
https://github.com/bbinet/heka-hl-sandboxes
And here is the MainRegexDispatchFilter code and config:
https://github.com/bbinet/heka-hl-sandboxes/blob/master/filters/regex_dispatch_metric.lua
https://github.com/bbinet/heka-hl-sandboxes/blob/master/toml/filter_dispatch.toml#L33-74
Do you see anything wrong here?
A superficial look is all I have time for, but at first glance I don't see
anything wrong, per se. But I do see some stuff that supports the deadlock
hypothesis. It looks like you have some filters emitting messages that other
filters are then consuming. Also, you have filters that are potentially
emitting several output messages for every single input message.
Those are absolutely reasonable things to be doing, but if the pool size is too
small they can lead to the situation I described. One filter might get a big
slice of scheduler time, during which it emits a bunch of messages to the
router. Then the router and the matchers get time, so the router and match
channel fill up. Then finally the second filter gets time, and while doing its
job it calls inject_message ten times, which means it's trying to pull ten
messages off the injectRecycleChan. If there aren't ten messages available,
then it blocks, waiting for more messages to become available. But they won't
become available, b/c most of them are sitting on the match chan and input chan
for that very filter. Hence the deadlock.
The solution is to tweak the pool size and/or chan size so the number of
available messages is high enough that your use case won't consume them all and
get stuck.
Why does it work during ~40min before it starts to fail?
There are a couple of possibilities. If it's packs not being
recycled that's the problem, then this would mean there's a slow
drip. First you lose one pack, then another, then another, until
after 40 minutes enough of them are unavailable that pack exhaustion
becomes a problem and you hit the deadlock.
I think it's not the case here because it can fail sometimes much sooner
or longer than 40min.
Another possibility is that things are working fine until the Go
scheduler happens to land you in an unlucky place. Imagine that you
have a filter (filter A) that is injecting a lot of messages, all of
which will be picked up by another filter (filter B) that will in
turn inject more messages. If both filters are given about equal
time by the scheduler then everything will flow smoothly. But if the
first filter gets a longer chunk of time, then it might consume all
of the messages in the injection pool, most of which will be backed
up in filter B's input channels. Then the scheduler switches to
filter B, but it's now blocked because it needs a pack from the
injection pool, and there aren't any.
My guess is that second scenario is what you're hitting. If I'm
right, then your problem might go away if you just increase the
poolsize setting (see
https://hekad.readthedocs.org/en/latest/config/index.html#global-configuration-options).
Try cranking it up to 200 and see if that has any impact, and let us
know how it goes.
Yes, it seems I'm facing this second scenario. I've increased the
poolsize to 200 yesterday evening and Heka is now running smoothly for
more than 17 hours.
Great!
One more question: if I don't want to increase the poolsize value (to
minimize the memory that heka consumes), is it equivalent to decrease
the `plugin_chansize` value down to 15?
Roughly, yes. The one thing to pay attention to is that reducing the chan size
too much will cause more scheduler churn, which might impact performance. In my
testing things seem to flow smoothly all the way down to a chan size of about
25, but below that you can start to see throughput drop a bit since context
switching happens more often. I'd try tweaking the pool size and chan size
settings until you find what works best for you.
Thanks a lot for your help!
Bruno
You're welcome!
-r
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka