On Mon, 31 Aug 2009, Rainer Gerhards wrote:

> On Fri, 2009-08-28 at 14:55 -0700, [email protected] wrote:
>> On Fri, 28 Aug 2009, Rainer Gerhards wrote:
>>> Also, it would be good if you could --enable-rtinst --enable-debug and try
>>> out that version on your machine. I am a bit concerned about the speed of 
>>> the
>>> resulting executable, it may be too slow. You do not need to run it in debug
>>> mode itself. These option (especially--enable-debug) will activate in-depth
>>> runtime checks (assert, will abort when something wrong happens) and my hope
>>> is that they will catch the bug closer to the root cause. If so, I would 
>>> need
>>> the gdb abort info (actually enabling debug output would be an option some
>>> time later).
>>>
>>> Please let me know what would be OK with you.
>>
>> I will give this a try.
>>
>> I was going to suggest that since we have the message getting corrupted it
>> may make sense to make a temporary branch that has multiple message
>> buffers and at various times through the message processing it makes a
>> copy of the emssage to the buffer. when the system crashes I will be able
>> to look at the core and see where the message is getting corrupted.
>
> David, I fear it is even more complicated than that. It looks like not
> only the message got corrupted but the message object itself. There are
> already two copies of some of the message elements, and they also look
> inconsistent - except, if we really had a null message, that is one with
> no content at all (and generating a message object from a null message,
> I think, would be a bug in itself - but I am sure there are no such
> messages in your actual traffic). If you think there could be a real
> null message, I'd follow that path (will probably do so in any case...).

I know that in some places on my network I am seeing malformed messages 
that look like they are overflowing one packet and so trying to go into a 
second packet (with the result being 20 or so characters being the entire 
contents of the message and showing up as the system name with no actual 
system tag or message folowing it)

it's possible that there are packets with nothing in them, but I am not 
aware of them.

> I think that what really happens is that some part of the code runs
> wild, thus invalidating some random part of the main memory. At some
> times, it hits queue structures (or the message object that is held by
> them) and if so, we will see the abort you experience. With that
> scenario, duplicating the message buffer does not really help, because
> looking at the corrupted message object would not provide any additional
> information.

ouch

> However, if that's easy enough to reproduce, it would probably be good
> if you could send me the core analysis (the backtrace and the print
> statements) from a few (five maybe?) independent aborts. Maybe they show
> a pattern. It would probably best to send them via private mail, as I am
> not sure if they disclose more than they should.

I will see about doing that.

>>
>> I will see about doing a tcpdump at the time that I do this and send it to
>> you (I'll need to check with management, but since we have a contract in
>> place for other reasons I think we can do this)
>>
>
> That would probably be a good thing. I've made some progress with my
> testing tool, and I have created a basic version right now. Probably not
> good enough to mimic your traffic pattern, but closer. I am doing a test
> run for quite some time now, unfortunately so far without abort.
>
> Note that I run into the trouble with UDP - even though I've put some
> one-ms sleeps into the code, I lose a lot of messages, as it looks even
> before they hit the wire. It's always real trobulesome to test with
> UDP...

interesting. I have been able to get very high transmission rates with UDP 
without loosing packets.

what I did was to use syslog to generate sample messages, captured them 
with tcpdump, and then used tcpreplay to send them at varying data rates.

David Lang

> Rainer
>> I can't do this late on a friday, but I should be able to do this monday
>> afternoon.
>>
>> David Lang
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to