Hello, I check ChangeLog and found 5.8.4 fixed some bug related to race condition as discussed in http://bugzilla.adiscon.com/show_bug.cgi?id=275
with patch protecting modification in the getMSGID(): http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=d66f7b95e081ecb356cc230685c5b7c90f8590b4 I guess the situation is almost identical to what in my case: corruption is always at the APPNAME field and there is, I believe, a typo in line 1955 from getAPPNAME() effectively ruins the mutex proctection: 1947 /* rgerhards, 2005-11-24 1948 */ 1949 char *getAPPNAME(msg_t *pM, sbool bLockMutex) 1950 { 1951 uchar *pszRet; 1952 1953 assert(pM != NULL); 1954 if(bLockMutex == LOCK_MUTEX) 1955 MsgUnlock(pM); /* should be MsgLock(pM) */ 1956 prepareAPPNAME(pM, MUTEX_ALREADY_LOCKED); 1957 if(pM->pCSAPPNAME == NULL) 1958 pszRet = UCHAR_CONSTANT(""); 1959 else 1960 pszRet = rsCStrGetSzStrNoNULL(pM->pCSAPPNAME); 1961 if(bLockMutex == LOCK_MUTEX) 1962 MsgUnlock(pM); 1963 return (char*)pszRet; 1964 } Similar cases were spotted in getPROCID(), getStructuredData() and getProgramName(). Hopefully the following patch against 5.8.6 tarball should solve the problem, I will report back later. --- rsyslog-5.8.6/runtime/msg.c.orig 2011-10-21 17:53:02.000000000 +0800 +++ rsyslog-5.8.6/runtime/msg.c 2012-01-05 21:54:26.594666510 +0800 @@ -1609,7 +1609,7 @@ ISOBJ_TYPE_assert(pM, msg); if(bLockMutex == LOCK_MUTEX) - MsgUnlock(pM); + MsgLock(pM); preparePROCID(pM, MUTEX_ALREADY_LOCKED); if(pM->pCSPROCID == NULL) pszRet = UCHAR_CONSTANT(""); @@ -1846,7 +1846,7 @@ { uchar *pszRet; - MsgUnlock(pM); + MsgLock(pM); if(pM->pCSStrucData == NULL) pszRet = UCHAR_CONSTANT("-"); else @@ -1894,7 +1894,7 @@ uchar *pszRet; if(bLockMutex == LOCK_MUTEX) - MsgUnlock(pM); + MsgLock(pM); prepareProgramName(pM, MUTEX_ALREADY_LOCKED); if(pM->pCSProgName == NULL) pszRet = UCHAR_CONSTANT(""); @@ -1952,7 +1952,7 @@ assert(pM != NULL); if(bLockMutex == LOCK_MUTEX) - MsgUnlock(pM); + MsgLock(pM); prepareAPPNAME(pM, MUTEX_ALREADY_LOCKED); if(pM->pCSAPPNAME == NULL) pszRet = UCHAR_CONSTANT(""); By the way, many crashes led to the convergence, for example //// db1 core.26581 signal 11, Segmentation fault. (gdb) p *pThis $1 = { pBuf = 0x7f7f5c000cd0 "sshd\177\177", pszBuf = 0x7f7f5c000dd0 "sshd", iBufSize = 4, iStrLen = 4 } (gdb) where #0 0x000000000041e6a9 in rsCStrGetSzStr (pThis=0x131cb10) at stringbuf.c:334 #1 0x000000000041b1bf in getAPPNAME (pM=0x131c140, bLockMutex=1 '\001') at msg.c:1960 #2 0x000000000041c0f8 in MsgGetProp (pMsg=0x131c140, pTpe=0x12ef350, propID=115 's', pPropLen=0x43fb8f00, pbMustBeFreed=0xfefefefefefefeff) at msg.c:2429 #3 0x000000000043be88 in tplToString (pTpl=0x12eeca0, pMsg=0x131c140, ppBuf=0x12f7910, pLenBuf=0x12f7930) at ../template.c:119 #4 0x0000000000438498 in prepareBatch (pAction=0x12f72d0, pBatch=0x12f7898, pbShutdownImmediate=<value optimized out>) at ../action.c:738 #5 processBatchMain (pAction=0x12f72d0, pBatch=0x12f7898, pbShutdownImmediate=<value optimized out>) at ../action.c:1163 #6 0x000000000043190d in ConsumerReg (pThis=0x12f73f0, pWti=0x12f7870) at queue.c:1705 #7 0x000000000042b0bd in wtiWorker (pThis=0x12f7870) at wti.c:315 #8 0x000000000042abd8 in wtpWorker (arg=0x12f7870) at wtp.c:387 #9 0x00000031d5a06367 in __nptl_setxid () from /lib64/libpthread.so.0 #10 0x0000000000000000 in ?? () Thanks, Kaiwang 2011/12/31 Kaiwang Chen <[email protected]>: > Even worse case > <6>1 2011-12-31T12:04:05.036812+08:00 squid6 kernel - - imklog 5.8.6, > log source = /proc/kmsg started. > ...several SYN flooding entries in the same second... > *** glibc detected *** /sbin/rsyslogd: free(): invalid next size > (fast): 0x00007fafb8000e00 *** > *** glibc detected *** /sbin/rsyslogd: malloc(): memory corruption: > 0x00007fafb8000e20 *** > > And possibly hours later the host was not responsive to ssh > connection requests, and even TTY login attempts failed... > Interestingly the debug log in another file continues a little beyond: > > Debug line with all properties: > FROMHOST: 'squid6', fromhost-ip: '127.0.0.1', HOSTNAME: 'squid6', PRI: 47, > syslogtag 'rsyslogd-pstats:', programname: 'rsyslogd-pstats', > APP-NAME: 'rsyslogd-pstats', PROCID: '', MSGID: '-', > TIMESTAMP: 'Dec 31 12:24:05', STRUCTURED-DATA: '-', > msg: 'main Q[DA]: size=0 enqueued=0 full=0 maxqsize=0 ' > escaped msg: 'main Q[DA]: size=0 enqueued=0 full=0 maxqsize=0 ' > inputname: impstats rawmsg: 'main Q[DA]: size=0 enqueued=0 full=0 maxqsize=0 ' > > Debug line with all properties: > FROMHOST: 'squid6', fromhost-ip: '127.0.0.1', HOSTNAME: 'squid6', PRI: 47, > syslogtag 'rsyslogd-pstats:', programname: 'rsyslogd-pstats', > APP-NAME: 'rsyslogd-pstats', PROCID: '', MSGID: '-', > TIMESTAMP: 'Dec 31 12:24:05', STRUCTURED-DATA: '-', > msg: 'main Q: size=4 enqueued=30 full=0 maxqsize=5 ' > escaped msg: 'main Q: size=4 enqueued=30 full=0 maxqsize=5 ' > inputname: impstats rawmsg: 'main Q: size=4 enqueued=30 full=0 maxqsize=5 ' > > Fortunately the squid service on that host was not badly affected. So > I did graceful failover and reboot the host, everything became OK. > Although there was no direct timestamp associated, I guess the problem > occured soon after rsyslogd started. I also spotted several corrupted > messages on other hosts when rsyslogd started. They are same version: > > # rsyslogd -v > rsyslogd 5.8.6, compiled with: > FEATURE_REGEXP: Yes > FEATURE_LARGEFILE: No > GSSAPI Kerberos 5 support: Yes > FEATURE_DEBUG (debug build, slow code): No > 32bit Atomic operations supported: Yes > 64bit Atomic operations supported: Yes > Runtime Instrumentation (slow code): No > > > Thanks, > Kaiwang > > 2011/12/31 Kaiwang Chen <[email protected]>: >> Also spotted corrupted messages from imuxsock: >> >> Debug line with all properties: >> FROMHOST: 'squid9', fromhost-ip: '127.0.0.1', HOSTNAME: 'squid9', PRI: 47, >> syslogtag 'rsyslogd-pstats:', programname: '<88>+?<99>2 >> <88>+?<99>rssyslogd-pstats', APP-NAME: '', PROCID: '', MSGID: '-', >> TIMESTAMP: 'Dec 30 21:46:14', STRUCTURED-DATA: '-', >> msg: 'imuxsock: submitted=429 ratelimit.discarded=0 >> ratelimit.numratelimiters=213 ' >> escaped msg: 'imuxsock: submitted=429 ratelimit.discarded=0 >> ratelimit.numratelimiters=213 ' >> inputname: impstats rawmsg: 'imuxsock: submitted=429 >> ratelimit.discarded=0 ratelimit.numratelimiters=213 ' >> >> >> Thanks, >> Kaiwang >> >> 2011/12/27 Kaiwang Chen <[email protected]>: >>> 2011/12/25 <[email protected]>: >>>> On Fri, 23 Dec 2011, Kaiwang Chen wrote: >>>> >>>>> Hi all, >>>>> >>>>> I found rsyslogd occasionally produced corrupted log entries like >>>>> >>>>> <6>1 2011-12-23T23:03:18.089938+08:00 db1 <D0>#001 D^kernel - - >>>>> device eth0 entered promiscuous mode >>>>> >>>>> I believe that problem appeared in earlier versions including 5.8.2. >>>>> Looks like corruption never occur before hostname field. Is it a >>>>> reported bug? Any clue? >>>> >>>> >>>> the question is if this problem is in the raw message being sent to you, or >>>> is it something added by the rsyslog processing. >>>> >>>> I would suggest adding something along the following line. >>>> >>>> :rawmesg, :contains, "#001" /var/log/badmessages:RSYSLOG_DEBUG >>>> >>>> this will look for any messages with the hex 01 character in them and spit >>>> out all the info that you can use about the message in to the file >>>> /var/log/badmessages >>>> >>>> run this for a little while and look to see what the raw message that is >>>> being received over the wire looks like. If it's bad, then you need to look >>>> at the sender. If the raw message looks sane, but rsyslog isn't handling it >>>> right, then we can troubleshoot from there. >>> >>> Looks like it's related to tcp transmission. On the original server >>> that get input from unix socket, it is OK; while on the centeral log >>> server, the raw message is already corrupted. >>> >>> ===== on central log server >>> <30>1 2011-12-27T14:06:56+08:00 gw71 snmpd 24254 - - Connection from >>> UDP: [172.25.0.230]:53547 >>> <30>1 2011-12-27T14:06:56+08:00 gw71 snmpd 24254 - - Received SNMP >>> packet(s) from UDP: [172.25.0.230]:53547 >>> <30>1 2011-12-27T14:06:56+08:00 gw71 snmpd 2#032#025<D5>1 - - >>> Connection from UDP: [172.25.0.230]:53547 >>> >>> Debug line with all properties: >>> FROMHOST: '172.25.0.71', fromhost-ip: '172.25.0.71', HOSTNAME: 'gw71', PRI: >>> 30, >>> syslogtag 'snmpd[24254]', programname: 'snmpd', APP-NAME: 'snmpd', >>> PROCID: '24254', MSGID: '-', >>> TIMESTAMP: 'Dec 27 14:06:56', STRUCTURED-DATA: '-', >>> msg: ' Connection from UDP: [172.25.0.230]:53547' >>> escaped msg: ' Connection from UDP: [172.25.0.230]:53547' >>> inputname: imptcp rawmsg: '<30>1 2011-12-27T14:06:56+08:00 gw71 snmpd >>> 24254 - - Connection from UDP: [172.25.0.230]:53547' >>> >>> Debug line with all properties: >>> FROMHOST: '172.25.0.71', fromhost-ip: '172.25.0.71', HOSTNAME: 'gw71', PRI: >>> 30, >>> syslogtag 'snmpd[24254]', programname: 'snmpd', APP-NAME: 'snmpd', >>> PROCID: '24254', MSGID: '-', >>> TIMESTAMP: 'Dec 27 14:06:56', STRUCTURED-DATA: '-', >>> msg: ' Received SNMP packet(s) from UDP: [172.25.0.230]:53547' >>> escaped msg: ' Received SNMP packet(s) from UDP: [172.25.0.230]:53547' >>> inputname: imptcp rawmsg: '<30>1 2011-12-27T14:06:56+08:00 gw71 snmpd >>> 24254 - - Received SNMP packet(s) from UDP: [172.25.0.230]:53547' >>> >>> Debug line with all properties: >>> FROMHOST: '172.25.0.71', fromhost-ip: '172.25.0.71', HOSTNAME: 'gw71', PRI: >>> 30, >>> syslogtag 'snmpd[2#032#025<D5>1]', programname: 'snmpd', APP-NAME: >>> 'snmpd', PROCID: '2#032#025<D5>1', MSGID: '-', >>> TIMESTAMP: 'Dec 27 14:06:56', STRUCTURED-DATA: '-', >>> msg: ' Connection from UDP: [172.25.0.230]:53547' >>> escaped msg: ' Connection from UDP: [172.25.0.230]:53547' >>> inputname: imptcp rawmsg: '<30>1 2011-12-27T14:06:56+08:00 gw71 snmpd >>> 2#032#025<D5>1 - - Connection from UDP: [172.25.0.230]:53547' >>> >>> ===== on the orginal host (*.* @@(o)172.25.0.230:514) >>> <30>1 2011-12-27T14:06:56+08:00 gw71 snmpd 24254 - - Connection from >>> UDP: [172.25.0.230]:53547 >>> <30>1 2011-12-27T14:06:56+08:00 gw71 snmpd 24254 - - Received SNMP >>> packet(s) from UDP: [172.25.0.230]:53547 >>> <30>1 2011-12-27T14:06:56+08:00 gw71 snmpd 24254 - - Connection from >>> UDP: [172.25.0.230]:53547 >>> >>> Debug line with all properties: >>> FROMHOST: 'gw71', fromhost-ip: '127.0.0.1', HOSTNAME: 'gw71', PRI: 30, >>> syslogtag 'snmpd[24254]:', programname: 'snmpd', APP-NAME: 'snmpd', >>> PROCID: '24254', MSGID: '-', >>> TIMESTAMP: 'Dec 27 14:06:56', STRUCTURED-DATA: '-', >>> msg: ' Connection from UDP: [172.25.0.230]:53547' >>> escaped msg: ' Connection from UDP: [172.25.0.230]:53547' >>> inputname: imuxsock rawmsg: '<30>Dec 27 14:06:56 snmpd[24254]: >>> Connection from UDP: [172.25.0.230]:53547' >>> >>> Debug line with all properties: >>> FROMHOST: 'gw71', fromhost-ip: '127.0.0.1', HOSTNAME: 'gw71', PRI: 30, >>> syslogtag 'snmpd[24254]:', programname: 'snmpd', APP-NAME: 'snmpd', >>> PROCID: '24254', MSGID: '-', >>> TIMESTAMP: 'Dec 27 14:06:56', STRUCTURED-DATA: '-', >>> msg: ' Received SNMP packet(s) from UDP: [172.25.0.230]:53547' >>> escaped msg: ' Received SNMP packet(s) from UDP: [172.25.0.230]:53547' >>> inputname: imuxsock rawmsg: '<30>Dec 27 14:06:56 snmpd[24254]: >>> Received SNMP packet(s) from UDP: [172.25.0.230]:53547' >>> >>> Debug line with all properties: >>> FROMHOST: 'gw71', fromhost-ip: '127.0.0.1', HOSTNAME: 'gw71', PRI: 30, >>> syslogtag 'snmpd[24254]:', programname: 'snmpd', APP-NAME: 'snmpd', >>> PROCID: '24254', MSGID: '-', >>> TIMESTAMP: 'Dec 27 14:06:56', STRUCTURED-DATA: '-', >>> msg: ' Connection from UDP: [172.25.0.230]:53547' >>> escaped msg: ' Connection from UDP: [172.25.0.230]:53547' >>> inputname: imuxsock rawmsg: '<30>Dec 27 14:06:56 snmpd[24254]: >>> Connection from UDP: [172.25.0.230]:53547' >>> >>> >>> Thanks, >>> Kaiwang >>>> >>>> David Lang >>>> >>>> > rsyslogd 5.8.6, compiled with: >>>>> >>>>> FEATURE_REGEXP: Yes >>>>> FEATURE_LARGEFILE: No >>>>> GSSAPI Kerberos 5 support: Yes >>>>> FEATURE_DEBUG (debug build, slow code): No >>>>> 32bit Atomic operations supported: Yes >>>>> 64bit Atomic operations supported: Yes >>>>> Runtime Instrumentation (slow code): No >>>>> >>>>> /etc/rsyslog.conf >>>>> $ActionFileDefaultTemplate RSYSLOG_SyslogProtocol23Format >>>>> $ActionForwardDefaultTemplate RSYSLOG_SyslogProtocol23Format >>>>> >>>>> $ModLoad imklog >>>>> $ModLoad imuxsock >>>>> $ModLoad impstats >>>>> >>>>> $SystemLogSocketIgnoreMsgTimestamp off >>>>> $SystemLogUsePIDFromSystem on >>>>> >>>>> $PStatInterval 600 >>>>> $PStatSeverity 7 >>>>> >>>>> $WorkDirectory /var/spool/rsyslog >>>>> >>>>> $MainMsgQueueSaveOnShutdown on >>>>> $MainMsgQueueFileName mq >>>>> $MainMsgQueueMaxFileSize 5m >>>>> >>>>> $ActionQueueType LinkedList >>>>> $ActionQueueSaveOnShutdown on >>>>> $ActionQueueFileName dbq >>>>> $ActionQueueMaxFileSize 10m >>>>> $ActionResumeRetryCount -1 >>>>> *.* @@(o)10.2.3.4 >>>>> >>>>> >>>>> # Log all kernel messages to the console. >>>>> # Logging much else clutters up the screen. >>>>> #kern.* /dev/console >>>>> >>>>> # Log anything (except mail) of level info or higher. >>>>> # Don't log private authentication messages! >>>>> *.info;mail.none;authpriv.none;cron.none /var/log/messages >>>>> >>>>> # The authpriv file has restricted access. >>>>> authpriv.* /var/log/secure >>>>> >>>>> # Log all the mail messages in one place. >>>>> mail.* -/var/log/maillog >>>>> >>>>> >>>>> # Log cron stuff >>>>> cron.* /var/log/cron >>>>> >>>>> # Everybody gets emergency messages >>>>> *.emerg * >>>>> >>>>> # Save news errors of level crit and higher in a special file. >>>>> uucp,news.crit /var/log/spooler >>>>> >>>>> # Save boot messages also to boot.log >>>>> local7.* /var/log/boot.log >>>>> >>>>> >>>>> Thanks, >>>>> Kaiwang >>>>> _______________________________________________ >>>>> rsyslog mailing list >>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>> http://www.rsyslog.com/professional-services/ >>>>> >>>> _______________________________________________ >>>> rsyslog mailing list >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>> http://www.rsyslog.com/professional-services/ _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/

