On Nov 12, 2013, at 1:02 PM, Leggett, Torrance I. <[email protected]> wrote:
> So indeed it is logstash causing the backup. If I do UDP, the backup doesn't
> happen - I haven't verified logstash is getting every single message, but it
> does seem to be getting the vast majority. If I do TCP in an action queue,
> that action queue starts backing up as the main queue did previously - it's
> getting some messages to logstash, but most seem to just be backing up in the
> queue. Restarting logstash doesn't seem to help at all - those queue files
> always seem to stay on disk once they're there.
Looking at your config, two things stand out to me:
1) There doesn't appear to be a ${MainMsg/Action}ResumeRetryCount setting.
Setting this to -1 will cause rsyslog to keep retrying delivery to a failed
[remote] action destination forever, and is probably what you want in this case.
2) The ${MainMsg/Action}QueueLowWatermark setting may be too low.
When I was playing with rsyslog -> logstash -> elasticsearch I was running into
what might be a similar situation. First of all, I was regularly breaking
ElasticSearch (filling the local disk, running out of memory, etc.), so every
time that happened logstash delivery would hang, and the problem would
propagate back to the sending rsyslog and leave a bunch of DA files on disk.
But outside of that, "something" was causing the logstash feed to periodically
stop. When it did this, rsyslog would buffer to disk as expected, and the
incoming log rate visible at elasticsearch (via kibana) would appear to drop
dramatically (but be non-zero -- some logs were trickling in at a much lower
rate).
After some period of time, things would "unstick," the rsyslog backlog would
finally get delivered (and on-disk files removed except for the last one), and
elasticsearch would insert the new records in their correct chronological
location, back-filling the data in kibana.
This wound up being a consequence of rsyslog's high- and low- watermark
settings for this queue. When the number of messages in the in-memory queue
surged above the 80% (8000-message default) high water mark, rsyslog would
start writing the queue to disk and keep doing so until the in-memory queue
size dropped down to the 20%-full (2000-message default) low water mark. So
even though the short-term spike in message volume was over, it would continue
to send messages from mem->disk->mem->net until it got the queue all the way
back down to 20% full. This extra use of disk I/O dramatically reduced
throughput and explained the apparent drop in log volume seen by elasticsearch
and kibana in my case, at least.
Your high- and low-watermark numbers appear to also be 80% and 20% of the queue
size. So perhaps either 1) rsyslog gives up trying to send after 1 failure and
zero retries (the default!), or 2) maybe it really is delivering the backlog
very slowly? It's possible that the disk-constrained outbound traffic
throughput may not be fast enough to empty the queue before more incoming
traffic gets appended to the queue and eventually fills the filesystem.
If it's the latter case, then you could either A) put the DA queue files on
faster disk, or B) increase the LowWathermark to something more like 50-60% so
it get out of write-to-disk mode faster.
Some pstats output from the time leading up to and well into the problem would
be really helpful to get an idea of the message volume hitting the main message
queue and action queue, so you can tell where the messages are going.
- Dave
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.