Re: More bgpd problems
On 05/30/2012 04:27 AM, Matt Hamilton wrote: > James Shupe hermetek.com> writes: > >> I've been running it to peer with 3 IPv4 peers and 3 IPv6 peers (full >> views) and another partial IPv4 view with 12k routes (actually: varying >> amounts of peers over the years, but that's the current setup) since 4.5 >> without needing any cron jobs to watch over it. > > It looks like the issue is likely to be bgpd's interaction with ospfd. And/Or > CARP. I have CARP configured on two routers that act as gateways to one > of our upstream providers. They they speak OSPF and BGP to internal > routers and routers that peer with other remote networks. So I think > what happens is a CARP failover happens (they are quite regular for some > reason, but its never bothered me as it just works) and that causes > OSPF to change its metrics which in turn cause routing changes in BGP. > Its this propagating of events that I think is causing issues. > We've always been running OSPFD and, since 4.7/4.8? or so, OSPF6D (that's when it became usable for us), without issue. We also run CARP, because these routers are installed in pairs and also act as default gateways for machines behind directly them... so neither of those are ruled out in our setup. >> nrpe and ifstated run to verify the peers are up and react accordingly, >> but they never trigger unless there is a physical or provider issue. >> OpenBGPD has been rock solid for us. > > I'd be very interested to see your ifstated config and how you use > that to verify peers being up as we could do with some better > monitoring here. I'll get something together when I'm at work later, I'm shooting this email off real quick before I leave the house. > > -Matt > > > Thank you, -- James Shupe [demime 1.01d removed an attachment of type application/pgp-signature which had a name of signature.asc]
Re: More bgpd problems
Le Wed, 30 May 2012 09:27:23 + (UTC), Matt Hamilton a icrit : Hello, > I'd be very interested to see your ifstated config and how you use > that to verify peers being up as we could do with some better > monitoring here. Here we use "bgpctl show summary terse" with a grep on the peer name and "Established". Simple but it does the job. # bgpctl show summary terse RenaterV6 2200 Established RenaterV4 2200 Established (never see bgpd crashes) Regards.
Re: More bgpd problems
James Shupe hermetek.com> writes: > I've been running it to peer with 3 IPv4 peers and 3 IPv6 peers (full > views) and another partial IPv4 view with 12k routes (actually: varying > amounts of peers over the years, but that's the current setup) since 4.5 > without needing any cron jobs to watch over it. It looks like the issue is likely to be bgpd's interaction with ospfd. And/Or CARP. I have CARP configured on two routers that act as gateways to one of our upstream providers. They they speak OSPF and BGP to internal routers and routers that peer with other remote networks. So I think what happens is a CARP failover happens (they are quite regular for some reason, but its never bothered me as it just works) and that causes OSPF to change its metrics which in turn cause routing changes in BGP. Its this propagating of events that I think is causing issues. > nrpe and ifstated run to verify the peers are up and react accordingly, > but they never trigger unless there is a physical or provider issue. > OpenBGPD has been rock solid for us. I'd be very interested to see your ifstated config and how you use that to verify peers being up as we could do with some better monitoring here. -Matt
Re: More bgpd problems
On 2012-05-29, Matt Hamilton wrote: > Otto Moerbeek drijf.net> writes: > >> >> On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote: >> >> > Hi all, >> > >> > More bgpd problems last night :( This happened last night on two of our >> > routers. One running an old version of OpenBSD (4.3) and one running >> > 5.1. Is there anyone out there actually using bpgd in production? How >> > do you deal with it quitting everytime something unexpected happens on >> > the network? >> >> Yes, lots of people run it in production. > > That is what I'd expect. I just don't understand how with it keep dropping > out when it has some transient problem. > >> > >> > The first message below seems to indicate unable to allocate >> > memory. I'm running these boxes pretty much stock having not tuned any >> > parameters at all. Both are just running routing daemons (bgpd, ospf) >> > and the 4.3 box is running OpenVPN. There are no applications running >> > and both boxes have plenty of RAM (4GB) and not using any swap or >> > anything. >> > >> > Is there something I should look at tuning in terms >> > of memory allocation in order to stop this happening? >> > >> > OpenBSD 4.3/amd64: >> > >> > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot >> > allocate memory >> > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose >> > error: Cannot allocate memory >> > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision >> > engine exited >> > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error: >> > Broken pipe >> >> Only solution: upgrading. You are runing unsupported software, a >> foolish thing to do. > > Alas we don't all live in Utopia ;) This box is due to be upgraded soon, > but that upgrade is predicated on getting a stable routing environment > so that I can do so. At the moment we are mid-way through migrating > away from Cisco kit to OpenBSD routers. Until I can be confident that it > won't all just fall over I can't continue with the migration. I would *not* want to be running ospfd from before 5.1 on a DFZ router. First RTM_DESYNC (route socket overflows) were not dealt with at all in ospfd until 4.8 and from then until 5.1 they tended to result in lots of kernel route table dumps in quick succession to get back into sync, which is pretty hard on the machine, in 5.1 a holdoff timer was introduced for these resyncs. bgpd-wise since 4.3 there have been crashes fixed triggered by bad updates (these affected most BGP implementations not just OpenBSD) and numerous other fixes. If you are upgrading from that version then use bsd.rd to upgrade rather than untarring sets on the live system, and read the upgrade notes for the intermediate versions, I think that time period includes slight incompatible changes to bgpd.conf. > So any insight on why I would be getting the same symptoms on the 5.1 > box? And was getting bgpd dying before under 5.0? I'm finding it hard > to believe that this behaviour would have been tolerated by people > running bgpd in production all the way from the time of 4.3 to now. > Which leads to the only conclusion... I'm doing something stupid. > The question is what. I have ospfd and bgpd running. On the 5.1 box > there is also a CARP interface too (not an interface we are using ospfd on). > > -Matt > > Not sure when I started seeing it as I had various other problems on the network and with hardware back in the 4.3 days (what's that, 4 years ago or so?) Some people don't seem to hit it at all. One of the most common uses of OpenBGP is running as route server with mostly LAN-based connections and I suspect this type of setup is less likely to hit this problem. I usually only hit it on routers connected via wan links (redundant paths with ospf which flap on occasion). Usually hit the memory problem a few times in fairly quick succession, then not again for sometimes as much as a couple of months or even longer. Without having had a way to trigger it in the lab, and in my case not much storage on the routers to save dumps, getting more information to help track it down is challenging.. and of course I am reliant on out-of-band access and needing to get the network back up at that point, and often not fully awake having been woken by a text from icinga, so very limited debug opportunities. If you're better able to try and get some debug information, from what we've worked out more recently I would suggest flapping the ospf links as possibly triggering it.
Re: More bgpd problems
Philip Guenther gmail.com> writes: > Roger. To paraphrase: in order for such a process to be able to dump > core, do the following: > > Create /var/empty/var/crash/ and chown it to the user that the > [chroot'ed priv-sep'ed process] runs > as, then set the kern.nosuidcoredump sysctl to 2. OK, great. I've done that on all 7 boxes: 4 x OpenBSD 5.1/amd64 2 x OpenBSD 5.0/i386 1 x OpenBSD 4.3/amd64 and tested it with SIGABRT and I get a core file. So now just to sit and wait until it happens again. Thanks! -Matt
Re: More bgpd problems
On Tue, May 29, 2012 at 09:25:16PM +0200, Peter J. Philipp wrote: > Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g). And > install that. I have thought -current is compiled with debug, isn't it? jirib
Re: More bgpd problems
On 05/29/2012 05:41 AM, Garry Dolley wrote: > On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote: >> Hi all, >> >> More bgpd problems last night :( This happened last night on two of our >> routers. One running an old version of OpenBSD (4.3) and one running >> 5.1. Is there anyone out there actually using bpgd in production? How > > Yes. For the record I run it on OpenBSD 4.4; IPv6 traffic only. > While there have been some quirks over the years, I've never seen it > quit. > I've been running it to peer with 3 IPv4 peers and 3 IPv6 peers (full views) and another partial IPv4 view with 12k routes (actually: varying amounts of peers over the years, but that's the current setup) since 4.5 without needing any cron jobs to watch over it. nrpe and ifstated run to verify the peers are up and react accordingly, but they never trigger unless there is a physical or provider issue. OpenBGPD has been rock solid for us. -- James Shupe [demime 1.01d removed an attachment of type application/pgp-signature which had a name of signature.asc]
Re: More bgpd problems
On Tue, May 29, 2012 at 12:30 PM, Henning Brauer wrote: > * Peter J. Philipp [2012-05-29 21:26]: >> 1. Make BGPD dump core > > it doesn't work that way due to bgpd dropping privs and chrooting. > the way involves setting kern.nosuidcoredump to 2, but since we have > all that already written down in an email to a non-public list, it'll > be easiest to make that available. Roger. To paraphrase: in order for such a process to be able to dump core, do the following: Create /var/empty/var/crash/ and chown it to the user that the [chroot'ed priv-sep'ed process] runs as, then set the kern.nosuidcoredump sysctl to 2. Philip Guenther
Re: More bgpd problems
* Peter J. Philipp [2012-05-29 21:26]: > 1. Make BGPD dump core it doesn't work that way due to bgpd dropping privs and chrooting. the way involves setting kern.nosuidcoredump to 2, but since we have all that already written down in an email to a non-public list, it'll be easiest to make that available. -- Henning Brauer, h...@bsws.de, henn...@openbsd.org BS Web Services, http://bsws.de, Full-Service ISP Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed Henning Brauer Consulting, http://henningbrauer.com/
Re: More bgpd problems
On Tue, May 29, 2012 at 04:21:12PM +, Matt Hamilton wrote: > I will happily supply what I can. Just let me know how. Hello, I've never used BGPd personally but perhaps I can help you get a backtrace. There is quite possibly two ways to get a backtrace. 1. Make BGPD dump core Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g). And install that. Check the directory of the _bgpd user and make the directory writeable for the _bgpd user. If after another crash a bgpd.core file pops up you got it. You can test this by sending bgpd a SIGABRT and if it didn't core something is wrong, see #2. You then type 'gdb /usr/sbin/bgpd bgpd.core' and type backtrace within gdb. Type quit to exit gdb. Keep the bgpd.core file around by saving it to another location as it should overwrite with each subsequent segfault. 2. Attach gdb to the process and wait Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g). And install that. su to root, tmux the session and from within tmux attach to the bgpd process "gdb /usr/sbin/bgpd " once you're attached bgpd will cease running temporarily, just type "continue" (make sure you don't set any breakpoints). You can now wait until bgpd crashes on signal 11. gdb will break back to the debugger command line and you can type backtrace within gdb. Type quit to exit gdb. When you get to it when it crashed you can attach to the tmux session with "tmux att -d" and have before you the gdb command line. Even better than just a backtrace is going up and down the stack to see where the program crashed. Google for gdb commands. 3. Ask someone else who may have better Ideas. > Although as you said in another post > it is hard to replicate. All I seem to be able to see is that this happens > during some period of network instability. It seems that there is a > ripple affect that something happens and that then causes a bgpd > process to die which then propagates more changes to iBGP peers > and they then sometimes die as well. > > -Matt Cheers, -peter
Re: More bgpd problems
Henning Brauer bsws.de> writes: > > OpenBSD 5.1/amd64: > > May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine > > terminated; signal 11 > > now that is bad. sig11 = segfault, Must Not Happen (tm). > can you get us a backtrace? stuart, can we document the steps to do so > somewhere we can point people to? I will happily supply what I can. Just let me know how. Although as you said in another post it is hard to replicate. All I seem to be able to see is that this happens during some period of network instability. It seems that there is a ripple affect that something happens and that then causes a bgpd process to die which then propagates more changes to iBGP peers and they then sometimes die as well. -Matt
Re: More bgpd problems
Otto Moerbeek drijf.net> writes: > According to you previous message, you are getting a different > behaviour on the 5.1 box. A segfault is not the same as running out of mem. I agree. It seems strangely co-incidental though that bgpd on both version of OpenBSD died within minutes of each other. > As for the quitting problem: if a fatal error occurs, you don't have > any other choice than to quit. A fatal error means the process cannnot > be trusted any more. This is unsatisfactory, but the only way. true. > > to believe that this behaviour would have been tolerated by people > > running bgpd in production all the way from the time of 4.3 to now. > > Which leads to the only conclusion... I'm doing something stupid. > > The question is what. I have ospfd and bgpd running. On the 5.1 box > > there is also a CARP interface too (not an interface we are using ospfd on). > > > > -Matt > > There have been earlier reports of bgpd running out of mem or getting > segfaults. In some cases that lead to fixing bugs. There might remain > unsolved cases. > > Working with the developers is one way of getting problems resolved. > Ranting about "I cannot believe this is happening" is not a > constructive way to get closer to the solution. Sorry if you mis-understood what I wrote. I was not ranting, I was pointing out that as I can't believe it would be tolerated then it means I must be doing something stupid, or different, or wrong. -Matt
Re: More bgpd problems
On 29/05/2012, at 6:08 PM, Matt Hamilton wrote: > Stuart Henderson spacehopper.org> writes: > >> cron job to restart it, with a random delay to avoid two machines >> coming back up at the same time when all the routers at a site >> fail together... > > So you just check it every minute to see if it is alive? > > It seems to me to be a pretty fundamental design flaw in the software given > its role. I would expect it to return sending a packet or something, not > just exit. I run it on five routers in production, balancing a couple of Internet links and a connection to a peering point. ospfd and ospf6d handle the internal routing. I don't have a cron job to restart it because I wasn't aware this is necessary - its been running for a year now with no issues. There are however a few redundant paths, so if we did lose a router it wouldn't cause too many problems. Installations are a mix of 5.0 and 4.7, IIRC. Hardware is Dell R610s and R415s, plus an embedded Soekris board (at the peering point). Cheers, Patrick
Re: More bgpd problems
On Tue, May 29, 2012 at 10:06:37AM +, Matt Hamilton wrote: > Otto Moerbeek drijf.net> writes: > > > > > On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote: > > > > > Hi all, > > > > > > More bgpd problems last night :( This happened last night on two of our > > > routers. One running an old version of OpenBSD (4.3) and one running > > > 5.1. Is there anyone out there actually using bpgd in production? How > > > do you deal with it quitting everytime something unexpected happens on > > > the network? > > > > Yes, lots of people run it in production. > > That is what I'd expect. I just don't understand how with it keep dropping > out when it has some transient problem. > > > > > > > The first message below seems to indicate unable to allocate > > > memory. I'm running these boxes pretty much stock having not tuned any > > > parameters at all. Both are just running routing daemons (bgpd, ospf) > > > and the 4.3 box is running OpenVPN. There are no applications running > > > and both boxes have plenty of RAM (4GB) and not using any swap or > > > anything. > > > > > > Is there something I should look at tuning in terms > > > of memory allocation in order to stop this happening? > > > > > > OpenBSD 4.3/amd64: > > > > > > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot > > > allocate memory > > > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose > > > error: Cannot allocate memory > > > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision > > > engine exited > > > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error: > > > Broken pipe > > > > Only solution: upgrading. You are runing unsupported software, a > > foolish thing to do. > > Alas we don't all live in Utopia ;) This box is due to be upgraded soon, > but that upgrade is predicated on getting a stable routing environment > so that I can do so. At the moment we are mid-way through migrating > away from Cisco kit to OpenBSD routers. Until I can be confident that it > won't all just fall over I can't continue with the migration. > > So any insight on why I would be getting the same symptoms on the 5.1 > box? And was getting bgpd dying before under 5.0? I'm finding it hard According to you previous message, you are getting a different behaviour on the 5.1 box. A segfault is not the same as running out of mem. As for the quitting problem: if a fatal error occurs, you don't have any other choice than to quit. A fatal error means the process cannnot be trusted any more. This is unsatisfactory, but the only way. > to believe that this behaviour would have been tolerated by people > running bgpd in production all the way from the time of 4.3 to now. > Which leads to the only conclusion... I'm doing something stupid. > The question is what. I have ospfd and bgpd running. On the 5.1 box > there is also a CARP interface too (not an interface we are using ospfd on). > > -Matt There have been earlier reports of bgpd running out of mem or getting segfaults. In some cases that lead to fixing bugs. There might remain unsolved cases. Working with the developers is one way of getting problems resolved. Ranting about "I cannot believe this is happening" is not a constructive way to get closer to the solution. -Otto
Re: More bgpd problems
On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote: > Hi all, > > More bgpd problems last night :( This happened last night on two of our > routers. One running an old version of OpenBSD (4.3) and one running > 5.1. Is there anyone out there actually using bpgd in production? How Yes. For the record I run it on OpenBSD 4.4; IPv6 traffic only. While there have been some quirks over the years, I've never seen it quit. -- Garry Dolley ARP Networks, Inc. | http://www.arpnetworks.com | (818) 206-0181 Data center, VPS, and IP Transit solutions Member Los Angeles County REACT, Unit 336 | WQGK336 Blog http://scie.nti.st
Re: More bgpd problems
* Matt Hamilton [2012-05-29 12:02]: > Stuart Henderson spacehopper.org> writes: > > cron job to restart it, with a random delay to avoid two machines > > coming back up at the same time when all the routers at a site > > fail together... > So you just check it every minute to see if it is alive? > > It seems to me to be a pretty fundamental design flaw in the software given > its role. I would expect it to return sending a packet or something, not > just exit. it doesn't exit under normal circumstances. bgpd is used in a lot of places, some extremely large ones too. you'd be surprised. and no, they dont deal with "bgpd exiting constantly" or however you called it, not at all. > > > The first message below seems to indicate unable to allocate > > > memory. I'm running these boxes pretty much stock having not tuned any > > > parameters at all. Both are just running routing daemons (bgpd, ospf) > > > and the 4.3 box is running OpenVPN. There are no applications running > > > and both boxes have plenty of RAM (4GB) and not using any swap or > > > anything. > > > > > > Is there something I should look at tuning in terms > > > of memory allocation in order to stop this happening? > > > > Make sure login.conf memory limits for the daemon class (or the > > _bgpd class on a newer OS version using /etc/rc.d) are high enough. > > If your limits are insufficient for the size of routing table then > > obviously you will have a problem. But also there is a bug > > somewhere, possibly to do with nexthop changes, which can result > > in very rapidly increasing memory use. this bug is hard to trigger and we have not been able to identify a pattern here, except that it involves iBGP. -- Henning Brauer, h...@bsws.de, henn...@openbsd.org BS Web Services, http://bsws.de, Full-Service ISP Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed Henning Brauer Consulting, http://henningbrauer.com/
Re: More bgpd problems
* Matt Hamilton [2012-05-29 10:59]: > OpenBSD 4.3/amd64: > > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot > allocate memory out of memory. others have said enuff about running 4.3. > OpenBSD 5.1/amd64: > May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine > terminated; signal 11 now that is bad. sig11 = segfault, Must Not Happen (tm). can you get us a backtrace? stuart, can we document the steps to do so somewhere we can point people to? -- Henning Brauer, h...@bsws.de, henn...@openbsd.org BS Web Services, http://bsws.de, Full-Service ISP Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed Henning Brauer Consulting, http://henningbrauer.com/
Re: More bgpd problems
On Tue, May 29, 2012 at 10:00:53AM +, Matt Hamilton wrote: > Stuart Henderson spacehopper.org> writes: > > > cron job to restart it, with a random delay to avoid two machines > > coming back up at the same time when all the routers at a site > > fail together... > > So you just check it every minute to see if it is alive? > > It seems to me to be a pretty fundamental design flaw in the software given > its role. I would expect it to return sending a packet or something, not > just exit. > > > > The first message below seems to indicate unable to allocate > > > memory. I'm running these boxes pretty much stock having not tuned any > > > parameters at all. Both are just running routing daemons (bgpd, ospf) > > > and the 4.3 box is running OpenVPN. There are no applications running > > > and both boxes have plenty of RAM (4GB) and not using any swap or > > > anything. > > > > > > Is there something I should look at tuning in terms > > > of memory allocation in order to stop this happening? > > > > Make sure login.conf memory limits for the daemon class (or the > > _bgpd class on a newer OS version using /etc/rc.d) are high enough. > > If your limits are insufficient for the size of routing table then > > obviously you will have a problem. But also there is a bug > > somewhere, possibly to do with nexthop changes, which can result > > in very rapidly increasing memory use. > > Currently my routing table is pretty small. Only something like 150 > routes. This will increase once we start taking full feeds. At the moment > we only have a few partial feeds from networks we peer with and everything > else goes out a default route. > > I don't think it is a memory issue with the process itself, but the error > message seems to be more related to memory available to send the packet. > This is why I'm wondering if there is some sysctl or similar somewhere > I should be tweaking. > > -Matt the 4.x error and the 5.1 error are unrelated. Your first task should be to upgrade the 4.x machine. -Otto
Re: More bgpd problems
Otto Moerbeek drijf.net> writes: > > On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote: > > > Hi all, > > > > More bgpd problems last night :( This happened last night on two of our > > routers. One running an old version of OpenBSD (4.3) and one running > > 5.1. Is there anyone out there actually using bpgd in production? How > > do you deal with it quitting everytime something unexpected happens on > > the network? > > Yes, lots of people run it in production. That is what I'd expect. I just don't understand how with it keep dropping out when it has some transient problem. > > > > The first message below seems to indicate unable to allocate > > memory. I'm running these boxes pretty much stock having not tuned any > > parameters at all. Both are just running routing daemons (bgpd, ospf) > > and the 4.3 box is running OpenVPN. There are no applications running > > and both boxes have plenty of RAM (4GB) and not using any swap or > > anything. > > > > Is there something I should look at tuning in terms > > of memory allocation in order to stop this happening? > > > > OpenBSD 4.3/amd64: > > > > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot > > allocate memory > > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose > > error: Cannot allocate memory > > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision > > engine exited > > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error: > > Broken pipe > > Only solution: upgrading. You are runing unsupported software, a > foolish thing to do. Alas we don't all live in Utopia ;) This box is due to be upgraded soon, but that upgrade is predicated on getting a stable routing environment so that I can do so. At the moment we are mid-way through migrating away from Cisco kit to OpenBSD routers. Until I can be confident that it won't all just fall over I can't continue with the migration. So any insight on why I would be getting the same symptoms on the 5.1 box? And was getting bgpd dying before under 5.0? I'm finding it hard to believe that this behaviour would have been tolerated by people running bgpd in production all the way from the time of 4.3 to now. Which leads to the only conclusion... I'm doing something stupid. The question is what. I have ospfd and bgpd running. On the 5.1 box there is also a CARP interface too (not an interface we are using ospfd on). -Matt
Re: More bgpd problems
Stuart Henderson spacehopper.org> writes: > cron job to restart it, with a random delay to avoid two machines > coming back up at the same time when all the routers at a site > fail together... So you just check it every minute to see if it is alive? It seems to me to be a pretty fundamental design flaw in the software given its role. I would expect it to return sending a packet or something, not just exit. > > The first message below seems to indicate unable to allocate > > memory. I'm running these boxes pretty much stock having not tuned any > > parameters at all. Both are just running routing daemons (bgpd, ospf) > > and the 4.3 box is running OpenVPN. There are no applications running > > and both boxes have plenty of RAM (4GB) and not using any swap or > > anything. > > > > Is there something I should look at tuning in terms > > of memory allocation in order to stop this happening? > > Make sure login.conf memory limits for the daemon class (or the > _bgpd class on a newer OS version using /etc/rc.d) are high enough. > If your limits are insufficient for the size of routing table then > obviously you will have a problem. But also there is a bug > somewhere, possibly to do with nexthop changes, which can result > in very rapidly increasing memory use. Currently my routing table is pretty small. Only something like 150 routes. This will increase once we start taking full feeds. At the moment we only have a few partial feeds from networks we peer with and everything else goes out a default route. I don't think it is a memory issue with the process itself, but the error message seems to be more related to memory available to send the packet. This is why I'm wondering if there is some sysctl or similar somewhere I should be tweaking. -Matt
Re: More bgpd problems
On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote: > Hi all, > > More bgpd problems last night :( This happened last night on two of our > routers. One running an old version of OpenBSD (4.3) and one running > 5.1. Is there anyone out there actually using bpgd in production? How > do you deal with it quitting everytime something unexpected happens on > the network? Yes, lots of people run it in production. > > The first message below seems to indicate unable to allocate > memory. I'm running these boxes pretty much stock having not tuned any > parameters at all. Both are just running routing daemons (bgpd, ospf) > and the 4.3 box is running OpenVPN. There are no applications running > and both boxes have plenty of RAM (4GB) and not using any swap or > anything. > > Is there something I should look at tuning in terms > of memory allocation in order to stop this happening? > > OpenBSD 4.3/amd64: > > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot > allocate memory > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose > error: Cannot allocate memory > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision > engine exited > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error: > Broken pipe Only solution: upgrading. You are runing unsupported software, a foolish thing to do. > > OpenBSD 5.1/amd64: > > May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine > terminated; signal 11 > May 29 05:55:09 fw1 bgpd[21459]: fatal in SE: pipe write error: Broken > pipe This is a real issue. I'll leave this one to people more experienced running bgpd. -Otto
Re: More bgpd problems
On 2012-05-29, Matt Hamilton wrote: > More bgpd problems last night :( This happened last night on two of our > routers. One running an old version of OpenBSD (4.3) and one running > 5.1. Is there anyone out there actually using bpgd in production? Yes. > How > do you deal with it quitting everytime something unexpected happens on > the network? cron job to restart it, with a random delay to avoid two machines coming back up at the same time when all the routers at a site fail together... > The first message below seems to indicate unable to allocate > memory. I'm running these boxes pretty much stock having not tuned any > parameters at all. Both are just running routing daemons (bgpd, ospf) > and the 4.3 box is running OpenVPN. There are no applications running > and both boxes have plenty of RAM (4GB) and not using any swap or > anything. > > Is there something I should look at tuning in terms > of memory allocation in order to stop this happening? Make sure login.conf memory limits for the daemon class (or the _bgpd class on a newer OS version using /etc/rc.d) are high enough. If your limits are insufficient for the size of routing table then obviously you will have a problem. But also there is a bug somewhere, possibly to do with nexthop changes, which can result in very rapidly increasing memory use.