Args.. Step 3 is "gdb" not "bdb"... ----- Ursprüngliche Nachricht ----- Von: "Rainer Gerhards" <[email protected]> An: "rsyslog-users" <[email protected]> Gesendet: 25.08.09 18:30 Betreff: Re: [rsyslog] abort in 4.2.1
First shot at it: 1. Make sure core dump is written (ulimit -c 999999999) 2. Have it abort 3. bdb /path/to/binary/tsyslogd 4. Core name-of-corefile (usually /core.SOMENBR) 5. Enter: bt (for backtrace) 6. Enter: info thread (displays threads) 7. For each thread: 7a. Thread number 7b. Bt 8. You are done (ctl-d) Step 7 is necessary because the default bt does not necessarily point to the abort thread (some times it does, some times not...) rainer ----- Ursprüngliche Nachricht ----- Von: "[email protected]" <[email protected]> An: "rsyslog-users" <[email protected]> Gesendet: 25.08.09 17:59 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > Mmhhh... Unfortunately, this does not show anything immediately obvious. > Could you provide me with a gdb backtrace of the abort? Knowing where it > aborted often helps... I don't know how to do this. David Lang > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "[email protected]" <[email protected]> > An: "rsyslog-users" <[email protected]> > Gesendet: 25.08.09 17:16 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> Date: Tue, 25 Aug 2009 16:44:26 +0200 >> From: Rainer Gerhards <[email protected]> >> Reply-To: rsyslog-users <[email protected]> >> To: rsyslog-users <[email protected]> >> Subject: Re: [rsyslog] abort in 4.2.1 >> >> Ok that is good info. I'll still standby for the debug log, but if that >> doesn't show anything I'll probably look into crafting some small tools >> to create a similiar environment. Do the malformed messages theselv come >> in in burts (potentially without wellformed in between)? > > the ones from the cron job definantly come in bursts, but even after I had > them modify that script to make those messages well-formed I still had it > die (at the moment I had them revert that script to assist in this > debugging > > here is the tail of the debug log (with the messages themselves lightly > sanitized) > > note that the debug log was _very_ large > > -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug > > like the prior debugs, this dies on one of the malformed messages > > 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg > '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa > challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge > Presented|None|N/A|N/A|N/A' > 9570.652794351:418d6950: Message has legacy syslog format. > 9570.652803191:418d6950: Called action, logging to builtin-file > 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: > 0xc87970, state 0 > 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx > 9570.652836228:418d6950: entering actionCalldoAction(), state: itx > 9570.652845667:418d6950: file to log to: /var/log/messages > 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 > 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 > 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes > 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy > 9570.652893624:418d6950: action call returned 0 > 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.652909382:418d6950: XXXX: submitBatch got state 0 > 9570.652917182:418d6950: XXXX: submitBatch got state 0 > 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 > 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.652941060:418d6950: XXXX: qAddDirect returns 0 > 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 > 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 > 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 > 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 > 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.652988096:418d6950: Called action, logging to builtin-fwd > 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 > 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: > 0xc87970, state 0 > 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx > 9570.653021014:418d6950: entering actionCalldoAction(), state: itx > 9570.653030533:418d6950: 192.168.210.8:514/udp > 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy > 9570.653054811:418d6950: action call returned 0 > 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 > 9570.653071050:418d6950: XXXX: submitBatch got state 0 > 9570.653079010:418d6950: XXXX: submitBatch got state 0 > 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 > 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 > 9570.653104368:418d6950: XXXX: qAddDirect returns 0 > 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 > 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 > 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 > 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 > 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 > 9570.653152484:418d6950: XXXX: processMsg got return state 0 > 9570.653160723:418d6950: msgConsumer processes msg 28/32 > 9570.653168803:418d6950: dropped NUL at very end of message > 9570.653352789:430d9950: > recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 > 17:17:07|account summary|XXXXXXXXX > > 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 > entries > 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 > 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start > 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). > 9570.653416024:430d9950: --------imUDP calling select, active file > descriptors (max 4): 4 > >> rainer >> >> ----- Urspr?ngliche Nachricht ----- >> Von: "[email protected]" <[email protected]> >> An: "rsyslog-users" <[email protected]> >> Gesendet: 25.08.09 16:20 >> Betreff: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote: >>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it >>>>> dies >>>>> under load from malformed messages. >>>> >>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>>> just be that the race condition to cause the crash is smaller, 5.x is >>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>>> writing them locally and relaying them to another machine eats up <2% cpu >>>> according to top) >>>> >>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >>> >>> The bad thing about debug mode is that not only it is slower, but it >>> introduces some synchronization. So race bugs frequently disappear when >>> debug mode is turned on. Anyhow, sometimes they persist and then the >>> debug log often provides good information (aka "definitely worth a >>> try" ;)). >>> >>> I did some basic testing with the malformed message you provided in an >>> earlier message, but I unfortunately did not see anything that is not >>> clean. I am still a bit of the assumption that the malformednes of the >>> message is not a necessary condition for the segfault - but that needs >>> to be seen. No abort happened (yet) in my lab. >> >> I did finally get it to die, as soon as I get into the office I'll look at >> the end of the debug log >> >> the box I am duplicating this problem on relays all the logs it recieves >> up to another central box. the logs that come through this box are about a >> tenth of the total logs that the central box gets, and that central box >> has had no problems. >> >> the things that I see as being different are >> >> 1. the central box doesn't see the malformed messages (one of the relay >> boxes would fix that before forwarding it) >> >> 2. there are fewer systems sending simultaniously to the central box >> (there are ~100 boxes sending to the relay that dies, but only a half >> dozen relay boxes sending to the central box) >> >> two of the other relays handle a _far_ higher rate of logs, but from fewer >> sources (one has one source that spews ~15G of logs/day, the other >> recieves ~100m logs/day from 6 machines). a third relay has more machines >> sending it logs, but at a lower rate than those two (but still >> significantly higher than the one that fails). if there was a problem with >> load or the number of messages being recieved simultaniously I would >> expect one of these other three to have more problems than the one that >> fails on me. >> >> 3. a noticable fraction of the logs sent through this relay box are sent >> by a cron job running on each of ~60 machines that wakes up every min and >> scrapes a local file, sending all the pending messages, so the incoming >> messages are a bit burstier than normal, the relaying is still bursty, but >> it is only one bursty box, not many >> >> note that even if this cron job is stopped I still had 4.2.1 die on this >> relay box, so I don't think that it's the bursty nature of the traffic >> >> this is why I'm suspicious of the malformed message handling >> >> David Lang >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

