On Fri, 2009-08-28 at 14:55 -0700, [email protected] wrote: > On Fri, 28 Aug 2009, Rainer Gerhards wrote: > > Also, it would be good if you could --enable-rtinst --enable-debug and try > > out that version on your machine. I am a bit concerned about the speed of > > the > > resulting executable, it may be too slow. You do not need to run it in debug > > mode itself. These option (especially--enable-debug) will activate in-depth > > runtime checks (assert, will abort when something wrong happens) and my hope > > is that they will catch the bug closer to the root cause. If so, I would > > need > > the gdb abort info (actually enabling debug output would be an option some > > time later). > > > > Please let me know what would be OK with you. > > I will give this a try. > > I was going to suggest that since we have the message getting corrupted it > may make sense to make a temporary branch that has multiple message > buffers and at various times through the message processing it makes a > copy of the emssage to the buffer. when the system crashes I will be able > to look at the core and see where the message is getting corrupted.
David, I fear it is even more complicated than that. It looks like not only the message got corrupted but the message object itself. There are already two copies of some of the message elements, and they also look inconsistent - except, if we really had a null message, that is one with no content at all (and generating a message object from a null message, I think, would be a bug in itself - but I am sure there are no such messages in your actual traffic). If you think there could be a real null message, I'd follow that path (will probably do so in any case...). I think that what really happens is that some part of the code runs wild, thus invalidating some random part of the main memory. At some times, it hits queue structures (or the message object that is held by them) and if so, we will see the abort you experience. With that scenario, duplicating the message buffer does not really help, because looking at the corrupted message object would not provide any additional information. However, if that's easy enough to reproduce, it would probably be good if you could send me the core analysis (the backtrace and the print statements) from a few (five maybe?) independent aborts. Maybe they show a pattern. It would probably best to send them via private mail, as I am not sure if they disclose more than they should. > > I will see about doing a tcpdump at the time that I do this and send it to > you (I'll need to check with management, but since we have a contract in > place for other reasons I think we can do this) > That would probably be a good thing. I've made some progress with my testing tool, and I have created a basic version right now. Probably not good enough to mimic your traffic pattern, but closer. I am doing a test run for quite some time now, unfortunately so far without abort. Note that I run into the trouble with UDP - even though I've put some one-ms sleeps into the code, I lose a lot of messages, as it looks even before they hit the wire. It's always real trobulesome to test with UDP... Rainer > I can't do this late on a friday, but I should be able to do this monday > afternoon. > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

