I'll jump in and mention that I noticed the same error while I was testing a couple of diffs I sent to tech@ a while back. Although in my case, since I was operating on a test box with no load whatsoever, the way the bug manifested itself was by doing "relayctl reload" multiple times in quick succession. When I was testing, it only happened when I had relays that used SSL in some way, and the stack trace ended somewhere in the bowels of OpenSSL, which was way over my head. If it's helpful I can fire up my test bed and see if I can capture the stack trace again.
It sounds similar to what you're seeing, Zack, but if it's not the same I apologize for my unintentional thread-jacking. The error that I logged a while back was "relayd in free(): error: chunk is already free 0x203490200"; however I don't know if that was with with the stock relayd or after me trying to fix it. I'll try again with a stock relayd later on tonight and report my results. (Zack, your dmesg and relayd.conf didn't come through--send them inline, not as attachments.) I'm late to a meeting, but if a dmesg and/or relayd.conf are requested, I can and will provide them later. -- Seth On Wed, Jan 18, 2012 at 2:24 PM, Zack G. <posixb...@gmail.com> wrote: > Here's what I can tell you: > > When the system is under high PPS load, it relayd seems to restart > (and frequently at that)! unless I significantly raise the check > delays and timeouts. Otherwise, relayd functions normally (excepting > the lost hce child) with the lower, more preferable values. > > This bug is elusive as hell and doesn't rear its head often. But, > when it does, it usually does this repeatedly and continuously. I use > a command to auto-restart relayd when it signal 6's and the output > ends up looking like: > > Tue Jan 17 13:33:40 MST 2012 > restarted > Tue Jan 17 13:34:04 MST 2012 > restarted > Tue Jan 17 13:34:28 MST 2012 > restarted > Tue Jan 17 13:34:56 MST 2012 > restarted > Tue Jan 17 13:35:06 MST 2012 > restarted > Tue Jan 17 13:35:24 MST 2012 > restarted > Tue Jan 17 13:35:48 MST 2012 > restarted > Tue Jan 17 13:35:55 MST 2012 > restarted > Tue Jan 17 13:36:20 MST 2012 > restarted > > So, as you can see, this occurs rather frequently during high load PPS > load times. > > The error I see when running relayd with -dv is: > > relayd in free(): error: bogus pointer (double free?) 0x206ac8000 > lost child: hce terminated; signal 6 > pfe exiting, pid 12691 > relay exiting, pid 31468 > relay exiting, pid 5714 > relay exiting, pid 2319 > relay exiting, pid 19145 > relay exiting, pid 20233 > parent terminating, pid 5977 > > dmesg.boot.bz2 as requested by the FAQ is attached. > > I've also included a copy of the relayd.conf.bz2. > > I wish I could provide you with more information, but, this is as much > as I can provide at this point in time. Unfortunately, this problem > is most of an issue on our production router (as it's the only one > that receives such high traffic at any given point in time). I can't > tweak around with it enough to get further trace information and I > don't have the time/resources to dig further into this issue at the > moment. > > I hope this is enough to get started on the bug. If you need any more > information from me on my environment, I will do my best to get it for > you. > > Happy hacking and all the best, > > Zack > > [demime 1.01d removed an attachment of type application/x-bzip2 which had a name of relayd.conf.bz2] > > [demime 1.01d removed an attachment of type application/x-bzip2 which had a name of dmesg.boot.bz2]