Hi guys

I hope that someone here have some good ideas because I've run out of things to 
tweak.

We have a few rsyslog servers in the company, the basic setup is a couple of 
collectors/relays that forwards syslog to two archive servers.

We have run this setup for years but recently we have started pushing 
considerable (3-4 times as much) more syslog towards the servers and it appears 
they can't quite handle the extra load. The servers stop accepting TCP 
connections every now and then, for a minute or two and then works fine again 
without any pattern that I can see. I'm trying to figure out why and what I can 
do about it.

Each relay server receives a few thousand message per second at about 
30-40Mbit/s syslog and forwards that to two other rsyslog servers with a total 
of 60-80Mbit/s.

The relay servers are hovering around 5-7% CPU usage and little load around 0.3 
so I don't think it's a hardware limitation.


Notable parts of the config includes:

$MaxOpenFiles 81920 #there are usually only a few thousand open files, but 
currently this is what it sits at. I tried raising it to a MUCH higher number, 
didn't change anything

#we accept both UDP and TCP but it appears to be only TCP that is acting up so 
I'll concentrate on that.
module(
        load="imptcp"
        Threads="6" #was 2 when we had 4 CPU cores, now we have 8 CPU cores
)
input(
        type="imptcp"
        port="514"
        KeepAlive="on" #was running without probes, I have enabled them to try 
to keep the open (idle) connections down
        KeepAlive.Probes="3"
        KeepAlive.Interval="60"
)

#we forward to two severs, here's the config for one of them. We did have zip 
activated but I have disabled that, in an attempt to fix our issues and it 
didn't compress much anyways.
action(
        type="omfwd"
        target="REDACTED"
        port="6514"
        protocol="tcp"

        template="format_forward"

        queue.type="linkedlist"
        queue.filename="forward_to_archive"
        action.resumeRetryCount="-1"
        queue.saveOnShutdown="on"

        #compress syslog message during transfer to save bandwidth
#        ziplevel="1"
#        compression.mode="single"

        #encrypt syslog traffic
        StreamDriverMode="1" # run driver in TLS-only mode
        StreamDriverAuthMode="x509/name"
        StreamDriverPermittedPeers="REDACTED" #only permit the certificate from 
the verified syslog server
)


#I have added the impstats module but don't quiiiite know how to read the 
output, maybe it tells you guys something?

module(load="impstats"
        ResetCounters="on"
        interval="60"
        severity="7"
        log.syslog="off"
        log.file="/var/log/rsyslog-stats.log"
)

Tue Jul 16 09:23:13 2024: global: origin=dynstats
Tue Jul 16 09:23:13 2024: imuxsock: origin=imuxsock submitted=29 
ratelimit.discarded=0 ratelimit.numratelimiters=0
Tue Jul 16 09:23:13 2024: dynafile cache dynaFileDebug: origin=omfile 
requests=0 level0=0 missed=0 evicted=0 maxused=0 closetimeouts=0
Tue Jul 16 09:23:13 2024: action-0-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd: origin=core.action 
processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd: origin=core.action 
processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-3-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-4-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-5-builtin:omfile: origin=core.action 
processed=47 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-6-builtin:omfile: origin=core.action 
processed=74 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-7-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-8-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-9-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-10-builtin:omusrmsg: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=20084 
disallowed=0
Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=0 disallowed=0
Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv4): origin=imptcp submitted=180195 
sessions.opened=138 sessions.openfailed=130 sessions.closed=45 
bytes.received=38520208 bytes.decompressed=0
Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv6): origin=imptcp submitted=0 
sessions.opened=0 sessions.openfailed=0 sessions.closed=0 bytes.received=0 
bytes.decompressed=0
Tue Jul 16 09:23:13 2024: imtcp(6514): origin=imtcp submitted=0
Tue Jul 16 09:23:13 2024: imjournal: origin=imjournal submitted=58 read=57 
discarded=0 failed=0 poll_failed=0 rotations=0 recovery_attempts=0 
ratelimit_discarded_in_interval=0 disk_usage_bytes=1543561216
Tue Jul 16 09:23:13 2024: resource-usage: origin=impstats utime=37793520044 
stime=49266959335 maxrss=308628 minflt=90210820 majflt=4472 inblock=4641624 
oublock=497968144 nvcsw=4271503687 nivcsw=272265814 openfiles=3150
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue[DA]: origin=core.queue 
size=0 enqueued=0 full=0 discarded.full=0 discarded.nf=0 maxqsize=287009
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue: origin=core.queue 
size=0 enqueued=248342 full=0 discarded.full=0 discarded.nf=0 maxqsize=1000
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue[DA]: origin=core.queue 
size=158847 enqueued=143360 full=0 discarded.full=0 discarded.nf=0 
maxqsize=1620216
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue: origin=core.queue 
size=1000 enqueued=248342 full=702 discarded.full=2 discarded.nf=0 maxqsize=1000
Tue Jul 16 09:23:13 2024: main Q: origin=core.queue size=54384 enqueued=202693 
full=0 discarded.full=2 discarded.nf=0 maxqsize=100000
Tue Jul 16 09:23:13 2024: io-work-q: origin=imptcp enqueued=723 maxqsize=7
Tue Jul 16 09:23:13 2024: imudp(w3): origin=imudp called.recvmmsg=3256 
called.recvmsg=0 msgs.received=3803
Tue Jul 16 09:23:13 2024: imudp(w0): origin=imudp called.recvmmsg=3855 
called.recvmsg=0 msgs.received=4109
Tue Jul 16 09:23:13 2024: imudp(w2): origin=imudp called.recvmmsg=4418 
called.recvmsg=0 msgs.received=4799
Tue Jul 16 09:23:13 2024: imudp(w1): origin=imudp called.recvmmsg=6247 
called.recvmsg=0 msgs.received=7373


Without knowing "exactly" how to read the logs, I would think that 
sessions.openfailed=130 is something bad? Anything else to keep a lookout for?


I have also tried to manually change ulimit for the rsyslogd process, to as 
high/unlimited numbers as I could, but that didn't change anything.



Best regards

Jesper Skou Jensen
_______________________________________________
rsyslog mailing list
https://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to