Re: [rsyslog] abort in 4.2.1

david Tue, 25 Aug 2009 08:59:28 -0700

On Tue, 25 Aug 2009, Rainer Gerhards wrote:

> Mmhhh... Unfortunately, this does not show anything immediately obvious. 
> Could you provide me with a gdb backtrace of the abort? Knowing where it 
> aborted often helps...


I don't know how to do this.

David Lang

> rainer
>
> ----- Urspr?ngliche Nachricht -----
> Von: "[email protected]" <[email protected]>
> An: "rsyslog-users" <[email protected]>
> Gesendet: 25.08.09 17:16
> Betreff: Re: [rsyslog] abort in 4.2.1
>
> On Tue, 25 Aug 2009, Rainer Gerhards wrote:
>
>> Date: Tue, 25 Aug 2009 16:44:26 +0200
>> From: Rainer Gerhards <[email protected]>
>> Reply-To: rsyslog-users <[email protected]>
>> To: rsyslog-users <[email protected]>
>> Subject: Re: [rsyslog] abort in 4.2.1
>> 
>> Ok that is good info. I'll still standby for the debug log, but if that 
>> doesn't show anything I'll probably look into crafting some small tools 
>> to create a similiar environment. Do the malformed messages theselv come 
>> in in burts (potentially without wellformed in between)?
>
> the ones from the cron job definantly come in bursts, but even after I had 
> them modify that script to make those messages well-formed I still had it 
> die (at the moment I had them revert that script to assist in this 
> debugging
>
> here is the tail of the debug log (with the messages themselves lightly 
> sanitized)
>
> note that the debug log was _very_ large
>
> -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug
>
> like the prior debugs, this dies on one of the malformed messages
>
> 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg 
> '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa 
> challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge 
> Presented|None|N/A|N/A|N/A'
> 9570.652794351:418d6950: Message has legacy syslog format.
> 9570.652803191:418d6950: Called action, logging to builtin-file
> 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0
> 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 
> 0xc87970, state 0
> 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx
> 9570.652836228:418d6950: entering actionCalldoAction(), state: itx
> 9570.652845667:418d6950: file to log to: /var/log/messages
> 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174
> 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174
> 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes
> 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy
> 9570.652893624:418d6950: action call returned 0
> 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0
> 9570.652909382:418d6950: XXXX: submitBatch got state 0
> 9570.652917182:418d6950: XXXX: submitBatch got state 0
> 9570.652924941:418d6950: XXXX: submitBatch pre while state 0
> 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0
> 9570.652941060:418d6950: XXXX: qAddDirect returns 0
> 9570.652948899:418d6950: XXXX: queueEnqObj returns  0
> 9570.652956699:418d6950: XXXX: queueEnqObj returned 0
> 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0
> 9570.652972338:418d6950: XXXX: rule.processMsg returns 0
> 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0
> 9570.652988096:418d6950: Called action, logging to builtin-fwd
> 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0
> 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 
> 0xc87970, state 0
> 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx
> 9570.653021014:418d6950: entering actionCalldoAction(), state: itx
> 9570.653030533:418d6950:  192.168.210.8:514/udp
> 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy
> 9570.653054811:418d6950: action call returned 0
> 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0
> 9570.653071050:418d6950: XXXX: submitBatch got state 0
> 9570.653079010:418d6950: XXXX: submitBatch got state 0
> 9570.653087009:418d6950: XXXX: submitBatch pre while state 0
> 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0
> 9570.653104368:418d6950: XXXX: qAddDirect returns 0
> 9570.653112367:418d6950: XXXX: queueEnqObj returns  0
> 9570.653120446:418d6950: XXXX: queueEnqObj returned 0
> 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0
> 9570.653136525:418d6950: XXXX: rule.processMsg returns 0
> 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0
> 9570.653152484:418d6950: XXXX: processMsg got return state 0
> 9570.653160723:418d6950: msgConsumer processes msg 28/32
> 9570.653168803:418d6950: dropped NUL at very end of message
> 9570.653352789:430d9950: 
> recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 
> 17:17:07|account summary|XXXXXXXXX
>
> 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 
> entries
> 9570.653386266:430d9950: XXXX: queueEnqObj returns  0
> 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start
> 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514).
> 9570.653416024:430d9950: --------imUDP calling select, active file 
> descriptors (max 4): 4
>
>> rainer
>>
>> ----- Urspr?ngliche Nachricht -----
>> Von: "[email protected]" <[email protected]>
>> An: "rsyslog-users" <[email protected]>
>> Gesendet: 25.08.09 16:20
>> Betreff: Re: [rsyslog] abort in 4.2.1
>>
>> On Tue, 25 Aug 2009, Rainer Gerhards wrote:
>>
>>> On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote:
>>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it 
>>>>> dies
>>>>> under load from malformed messages.
>>>>
>>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may
>>>> just be that the race condition to cause the crash is smaller, 5.x is
>>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec,
>>>> writing them locally and relaying them to another machine eats up <2% cpu
>>>> according to top)
>>>>
>>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu)
>>>
>>> The bad thing about debug mode is that not only it is slower, but it
>>> introduces some synchronization. So race bugs frequently disappear when
>>> debug mode is turned on. Anyhow, sometimes they persist and then the
>>> debug log often provides good information (aka "definitely worth a
>>> try" ;)).
>>>
>>> I did some basic testing with the malformed message you provided in an
>>> earlier message, but I unfortunately did not see anything that is not
>>> clean. I am still a bit of the assumption that the malformednes of the
>>> message is not a necessary condition for the segfault - but that needs
>>> to be seen. No abort happened (yet) in my lab.
>>
>> I did finally get it to die, as soon as I get into the office I'll look at 
>> the end of the debug log
>>
>> the box I am duplicating this problem on relays all the logs it recieves 
>> up to another central box. the logs that come through this box are about a 
>> tenth of the total logs that the central box gets, and that central box 
>> has had no problems.
>>
>> the things that I see as being different are
>>
>> 1. the central box doesn't see the malformed messages (one of the relay 
>> boxes would fix that before forwarding it)
>>
>> 2. there are fewer systems sending simultaniously to the central box 
>> (there are ~100 boxes sending to the relay that dies, but only a half 
>> dozen relay boxes sending to the central box)
>>
>> two of the other relays handle a _far_ higher rate of logs, but from fewer 
>> sources (one has one source that spews ~15G of logs/day, the other 
>> recieves ~100m logs/day from 6 machines). a third relay has more machines 
>> sending it logs, but at a lower rate than those two (but still 
>> significantly higher than the one that fails). if there was a problem with 
>> load or the number of messages being recieved simultaniously I would 
>> expect one of these other three to have more problems than the one that 
>> fails on me.
>>
>> 3. a noticable fraction of the logs sent through this relay box are sent 
>> by a cron job running on each of ~60 machines that wakes up every min and 
>> scrapes a local file, sending all the pending messages, so the incoming 
>> messages are a bit burstier than normal, the relaying is still bursty, but 
>> it is only one bursty box, not many
>>
>> note that even if this cron job is stopped I still had 4.2.1 die on this 
>> relay box, so I don't think that it's the bursty nature of the traffic
>>
>> this is why I'm suspicious of the malformed message handling
>>
>> David Lang
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] abort in 4.2.1

Reply via email to