On Nov 12, 2013, at 1:02 PM, Leggett, Torrance I. <[email protected]> wrote:
> So indeed it is logstash causing the backup. If I do UDP, the backup doesn't 
> happen - I haven't verified logstash is getting every single message, but it 
> does seem to be getting the vast majority. If I do TCP in an action queue, 
> that action queue starts backing up as the main queue did previously - it's 
> getting some messages to logstash, but most seem to just be backing up in the 
> queue. Restarting logstash doesn't seem to help at all - those queue files 
> always seem to stay on disk once they're there.

Looking at your config, two things stand out to me:

1) There doesn't appear to be a ${MainMsg/Action}ResumeRetryCount setting.  
Setting this to -1 will cause rsyslog to keep retrying delivery to a failed 
[remote] action destination forever, and is probably what you want in this case.

2) The ${MainMsg/Action}QueueLowWatermark setting may be too low.

When I was playing with rsyslog -> logstash -> elasticsearch I was running into 
what might be a similar situation.  First of all, I was regularly breaking 
ElasticSearch (filling the local disk, running out of memory, etc.), so every 
time that happened logstash delivery would hang, and the problem would 
propagate back to the sending rsyslog and leave a bunch of DA files on disk.

But outside of that, "something" was causing the logstash feed to periodically 
stop. When it did this, rsyslog would buffer to disk as expected, and the 
incoming log rate visible at elasticsearch (via kibana) would appear to drop 
dramatically (but be non-zero -- some logs were trickling in at a much lower 
rate).

After some period of time, things would "unstick," the rsyslog backlog would 
finally get delivered (and on-disk files removed except for the last one), and 
elasticsearch would insert the new records in their correct chronological 
location, back-filling the data in kibana.
        
This wound up being a consequence of rsyslog's high- and low- watermark 
settings for this queue. When the number of messages in the in-memory queue 
surged above the 80% (8000-message default) high water mark, rsyslog would 
start writing the queue to disk and keep doing so until the in-memory queue 
size dropped down to the 20%-full (2000-message default) low water mark. So 
even though the short-term spike in message volume was over, it would continue 
to send messages from mem->disk->mem->net until it got the queue all the way 
back down to 20% full.  This extra use of disk I/O dramatically reduced 
throughput and explained the apparent drop in log volume seen by elasticsearch 
and kibana in my case, at least.

Your high- and low-watermark numbers appear to also be 80% and 20% of the queue 
size.  So perhaps either 1) rsyslog gives up trying to send after 1 failure and 
zero retries (the default!), or 2) maybe it really is delivering the backlog 
very slowly?  It's possible that the disk-constrained outbound traffic 
throughput may not be fast enough to empty the queue before more incoming 
traffic gets appended to the queue and eventually fills the filesystem.

If it's the latter case, then you could either A) put the DA queue files on 
faster disk, or B) increase the LowWathermark to something more like 50-60% so 
it get out of write-to-disk mode faster.

Some pstats output from the time leading up to and well into the problem would 
be really helpful to get an idea of the message volume hitting the main message 
queue and action queue, so you can tell where the messages are going.

- Dave
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to