On 18 May 2015 at 20:10, Rob Miller <[email protected]> wrote:

> On 05/18/2015 09:46 AM, bruno binet wrote:
>
>> Hi all,
>>
>> I'm using Heka to decode, filter, aggregate, and encode metrics to file,
>> mainly using lua sandboxes: at the beginning heka works perfectly,
>> consuming around 60% cpu cycles, but after ~40min the heka process
>> become idle (0-1% cpu) and the following diagnostic messages starts to
>> appear:
>>
>> 2015/05/18 16:20:07 Diagnostics: 38 packs have been idle more than 120
>> seconds.
>> 2015/05/18 16:20:07 Diagnostics: (input) Plugin names and quantities
>> found on idle packs:
>> 2015/05/18 16:20:07 Diagnostics:
>> Trwebclient1sLastAggregateMetricFilter: 38
>> 2015/05/18 16:20:07 Diagnostics:        MainRegexDispatchFilter: 32
>> 2015/05/18 16:20:07
>> 2015/05/18 16:20:07 Diagnostics: 25 packs have been idle more than 120
>> seconds.
>> 2015/05/18 16:20:07 Diagnostics: (inject) Plugin names and quantities
>> found on idle packs:
>> 2015/05/18 16:20:07 Diagnostics:        3sLastAggregateMetricFilter: 23
>> 2015/05/18 16:20:07 Diagnostics:        60sMinAggregateMetricFilter: 1
>> 2015/05/18 16:20:07 Diagnostics:        60sLastAggregateMetricFilter: 24
>> 2015/05/18 16:20:07
>>
> This message means that some messages are being delivered to one or more
> of your output / filter plugins, but they're never being recycled and freed
> up for reuse. There are a few reasons this could be happening. A poorly
> coded plugin that doesn't correctly recycle packs is one possible cause,
> but if you're mainly using SandboxFilters and aren't using any custom Go
> plugins then that's pretty unlikely.
>
> More likely is that you've got a deadlock happening, where possibly all of
> the packs from one of your pools are in use. Sometimes you'll get a
> situation where a plugin is blocked trying to get a new pack to populate,
> but most of the packs are tied up in the input channels for that very
> plugin, so everything comes to a halt.
>

I see, so this is likely a deadlock issue.


>
>> Hekad is configured with
>> [hekad]
>> max_message_loops = 10
>> maxprocs = 4
>>
>> Here is a heka report after heka was started, when everything was ok:
>> https://gist.github.com/bbinet/becc0ef3a92ad5883230
>>
>> And here is a heka report after ~40min, when heka was complaining about
>> the idle packs:
>> https://gist.github.com/bbinet/031ea86838ae501368c4
>>
>> Any idea of what is going on?
>>
> Here's what I can see from the second report:
>
> - The router input channel is completely full, which verifies that the
> router is backed up, so no new messages can get to the router.
>
> - The `injectRecycleChan` is completely empty. That means that all 100 of
> those packs are in use, and any filter that is trying to inject a message
> will be blocked waiting for one of the packs in the injection pool to be
> freed up.
>

Ok. For my information, what is the difference between `inputRecycleChan`
and `injectRecycleChan`?


> - Both the input and matcher channel for the `MainRegexDispatchFilter` are
> completely full, which means that this is the filter that is blocking
> everything. Probably this filter is blocked trying to inject a message,
> which it can't do because the injectRecycleChan is empty. Both channels
> being full accounts for 60 messages being tied up, which I'm guessing are
> 60 out of the 100 in our injection pool.
>

For the MainRegexDispatchFilter, what is the difference between
InChanCapacity and MatchChanCapacity?


> This doesn't get us all the way there, though. I'm guessing that the log
> output about '38 packs have been idle more than 120 seconds' is a clue
> about where most of the other 40 packs are. I'd need to see the code in
> those filters, and the config that's wiring it all together, to dig further.
>

Here is the repository for the filters and configuration in use:
https://github.com/bbinet/heka-hl-sandboxes

And here is the MainRegexDispatchFilter code and config:
https://github.com/bbinet/heka-hl-sandboxes/blob/master/filters/regex_dispatch_metric.lua
https://github.com/bbinet/heka-hl-sandboxes/blob/master/toml/filter_dispatch.toml#L33-74

Do you see anything wrong here?


>  Why does it work during ~40min before it starts to fail?
>>
> There are a couple of possibilities. If it's packs not being recycled
> that's the problem, then this would mean there's a slow drip. First you
> lose one pack, then another, then another, until after 40 minutes enough of
> them are unavailable that pack exhaustion becomes a problem and you hit the
> deadlock.
>

I think it's not the case here because it can fail sometimes much sooner or
longer than 40min.


> Another possibility is that things are working fine until the Go scheduler
> happens to land you in an unlucky place. Imagine that you have a filter
> (filter A) that is injecting a lot of messages, all of which will be picked
> up by another filter (filter B) that will in turn inject more messages. If
> both filters are given about equal time by the scheduler then everything
> will flow smoothly. But if the first filter gets a longer chunk of time,
> then it might consume all of the messages in the injection pool, most of
> which will be backed up in filter B's input channels. Then the scheduler
> switches to filter B, but it's now blocked because it needs a pack from the
> injection pool, and there aren't any.
>
> My guess is that second scenario is what you're hitting. If I'm right,
> then your problem might go away if you just increase the poolsize setting
> (see
> https://hekad.readthedocs.org/en/latest/config/index.html#global-configuration-options).
> Try cranking it up to 200 and see if that has any impact, and let us know
> how it goes.
>

Yes, it seems I'm facing this second scenario. I've increased the poolsize
to 200 yesterday evening and Heka is now running smoothly for more than 17
hours.

One more question: if I don't want to increase the poolsize value (to
minimize the memory that heka consumes), is it equivalent to decrease the
`plugin_chansize` value down to 15?

Thanks a lot for your help!
Bruno
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Reply via email to