Hi guys I hope that someone here have some good ideas because I've run out of things to tweak.
We have a few rsyslog servers in the company, the basic setup is a couple of collectors/relays that forwards syslog to two archive servers. We have run this setup for years but recently we have started pushing considerable (3-4 times as much) more syslog towards the servers and it appears they can't quite handle the extra load. The servers stop accepting TCP connections every now and then, for a minute or two and then works fine again without any pattern that I can see. I'm trying to figure out why and what I can do about it. Each relay server receives a few thousand message per second at about 30-40Mbit/s syslog and forwards that to two other rsyslog servers with a total of 60-80Mbit/s. The relay servers are hovering around 5-7% CPU usage and little load around 0.3 so I don't think it's a hardware limitation. Notable parts of the config includes: $MaxOpenFiles 81920 #there are usually only a few thousand open files, but currently this is what it sits at. I tried raising it to a MUCH higher number, didn't change anything #we accept both UDP and TCP but it appears to be only TCP that is acting up so I'll concentrate on that. module( load="imptcp" Threads="6" #was 2 when we had 4 CPU cores, now we have 8 CPU cores ) input( type="imptcp" port="514" KeepAlive="on" #was running without probes, I have enabled them to try to keep the open (idle) connections down KeepAlive.Probes="3" KeepAlive.Interval="60" ) #we forward to two severs, here's the config for one of them. We did have zip activated but I have disabled that, in an attempt to fix our issues and it didn't compress much anyways. action( type="omfwd" target="REDACTED" port="6514" protocol="tcp" template="format_forward" queue.type="linkedlist" queue.filename="forward_to_archive" action.resumeRetryCount="-1" queue.saveOnShutdown="on" #compress syslog message during transfer to save bandwidth # ziplevel="1" # compression.mode="single" #encrypt syslog traffic StreamDriverMode="1" # run driver in TLS-only mode StreamDriverAuthMode="x509/name" StreamDriverPermittedPeers="REDACTED" #only permit the certificate from the verified syslog server ) #I have added the impstats module but don't quiiiite know how to read the output, maybe it tells you guys something? module(load="impstats" ResetCounters="on" interval="60" severity="7" log.syslog="off" log.file="/var/log/rsyslog-stats.log" ) Tue Jul 16 09:23:13 2024: global: origin=dynstats Tue Jul 16 09:23:13 2024: imuxsock: origin=imuxsock submitted=29 ratelimit.discarded=0 ratelimit.numratelimiters=0 Tue Jul 16 09:23:13 2024: dynafile cache dynaFileDebug: origin=omfile requests=0 level0=0 missed=0 evicted=0 maxused=0 closetimeouts=0 Tue Jul 16 09:23:13 2024: action-0-builtin:omfile: origin=core.action processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd: origin=core.action processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd: origin=core.action processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-3-builtin:omfile: origin=core.action processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-4-builtin:omfile: origin=core.action processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-5-builtin:omfile: origin=core.action processed=47 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-6-builtin:omfile: origin=core.action processed=74 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-7-builtin:omfile: origin=core.action processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-8-builtin:omfile: origin=core.action processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-9-builtin:omfile: origin=core.action processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: action-10-builtin:omusrmsg: origin=core.action processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0 Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=20084 disallowed=0 Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=0 disallowed=0 Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv4): origin=imptcp submitted=180195 sessions.opened=138 sessions.openfailed=130 sessions.closed=45 bytes.received=38520208 bytes.decompressed=0 Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv6): origin=imptcp submitted=0 sessions.opened=0 sessions.openfailed=0 sessions.closed=0 bytes.received=0 bytes.decompressed=0 Tue Jul 16 09:23:13 2024: imtcp(6514): origin=imtcp submitted=0 Tue Jul 16 09:23:13 2024: imjournal: origin=imjournal submitted=58 read=57 discarded=0 failed=0 poll_failed=0 rotations=0 recovery_attempts=0 ratelimit_discarded_in_interval=0 disk_usage_bytes=1543561216 Tue Jul 16 09:23:13 2024: resource-usage: origin=impstats utime=37793520044 stime=49266959335 maxrss=308628 minflt=90210820 majflt=4472 inblock=4641624 oublock=497968144 nvcsw=4271503687 nivcsw=272265814 openfiles=3150 Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue[DA]: origin=core.queue size=0 enqueued=0 full=0 discarded.full=0 discarded.nf=0 maxqsize=287009 Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue: origin=core.queue size=0 enqueued=248342 full=0 discarded.full=0 discarded.nf=0 maxqsize=1000 Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue[DA]: origin=core.queue size=158847 enqueued=143360 full=0 discarded.full=0 discarded.nf=0 maxqsize=1620216 Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue: origin=core.queue size=1000 enqueued=248342 full=702 discarded.full=2 discarded.nf=0 maxqsize=1000 Tue Jul 16 09:23:13 2024: main Q: origin=core.queue size=54384 enqueued=202693 full=0 discarded.full=2 discarded.nf=0 maxqsize=100000 Tue Jul 16 09:23:13 2024: io-work-q: origin=imptcp enqueued=723 maxqsize=7 Tue Jul 16 09:23:13 2024: imudp(w3): origin=imudp called.recvmmsg=3256 called.recvmsg=0 msgs.received=3803 Tue Jul 16 09:23:13 2024: imudp(w0): origin=imudp called.recvmmsg=3855 called.recvmsg=0 msgs.received=4109 Tue Jul 16 09:23:13 2024: imudp(w2): origin=imudp called.recvmmsg=4418 called.recvmsg=0 msgs.received=4799 Tue Jul 16 09:23:13 2024: imudp(w1): origin=imudp called.recvmmsg=6247 called.recvmsg=0 msgs.received=7373 Without knowing "exactly" how to read the logs, I would think that sessions.openfailed=130 is something bad? Anything else to keep a lookout for? I have also tried to manually change ulimit for the rsyslogd process, to as high/unlimited numbers as I could, but that didn't change anything. Best regards Jesper Skou Jensen _______________________________________________ rsyslog mailing list https://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.