Re: More bgpd problems

2012-05-30 Thread James Shupe
On 05/30/2012 04:27 AM, Matt Hamilton wrote:
> James Shupe  hermetek.com> writes:
>
>> I've been running it to peer with 3 IPv4 peers and 3 IPv6 peers (full
>> views) and another partial IPv4 view with 12k routes (actually: varying
>> amounts of peers over the years, but that's the current setup) since 4.5
>> without needing any cron jobs to watch over it.
>
> It looks like the issue is likely to be bgpd's interaction with ospfd.
And/Or
> CARP. I have CARP configured on two routers that act as gateways to one
> of our upstream providers. They they speak OSPF and BGP to internal
> routers and routers that peer with other remote networks. So I think
> what happens is a CARP failover happens (they are quite regular for some
> reason, but its never bothered me as it just works) and that causes
> OSPF to change its metrics which in turn cause routing changes in BGP.
> Its this propagating of events that I think is causing issues.
>

We've always been running OSPFD and, since 4.7/4.8? or so, OSPF6D
(that's when it became usable for us), without issue. We also run CARP,
because these routers are installed in pairs and also act as default
gateways for machines behind directly them... so neither of those are
ruled out in our setup.

>> nrpe and ifstated run to verify the peers are up and react accordingly,
>> but they never trigger unless there is a physical or provider issue.
>> OpenBGPD has been rock solid for us.
>
> I'd be very interested to see your ifstated config and how you use
> that to verify peers being up as we could do with some better
> monitoring here.

I'll get something together when I'm at work later, I'm shooting this
email off real quick before I leave the house.

>
> -Matt
>
>
>

Thank you,
--
James Shupe

[demime 1.01d removed an attachment of type application/pgp-signature which had 
a name of signature.asc]



Re: More bgpd problems

2012-05-30 Thread Patrick Lamaiziere
Le Wed, 30 May 2012 09:27:23 + (UTC),
Matt Hamilton  a icrit :

Hello,

> I'd be very interested to see your ifstated config and how you use
> that to verify peers being up as we could do with some better
> monitoring here.

Here we use "bgpctl show summary terse" with a grep on the
peer name and "Established". Simple but it does the job.

# bgpctl show summary terse
RenaterV6 2200 Established
RenaterV4 2200 Established

(never see bgpd crashes)

Regards.



Re: More bgpd problems

2012-05-30 Thread Matt Hamilton
James Shupe  hermetek.com> writes:

> I've been running it to peer with 3 IPv4 peers and 3 IPv6 peers (full
> views) and another partial IPv4 view with 12k routes (actually: varying
> amounts of peers over the years, but that's the current setup) since 4.5
> without needing any cron jobs to watch over it.

It looks like the issue is likely to be bgpd's interaction with ospfd. And/Or
CARP. I have CARP configured on two routers that act as gateways to one
of our upstream providers. They they speak OSPF and BGP to internal
routers and routers that peer with other remote networks. So I think
what happens is a CARP failover happens (they are quite regular for some
reason, but its never bothered me as it just works) and that causes
OSPF to change its metrics which in turn cause routing changes in BGP.
Its this propagating of events that I think is causing issues.
 
> nrpe and ifstated run to verify the peers are up and react accordingly,
> but they never trigger unless there is a physical or provider issue.
> OpenBGPD has been rock solid for us.

I'd be very interested to see your ifstated config and how you use
that to verify peers being up as we could do with some better
monitoring here.

-Matt



Re: More bgpd problems

2012-05-30 Thread Stuart Henderson
On 2012-05-29, Matt Hamilton  wrote:
> Otto Moerbeek  drijf.net> writes:
>
>> 
>> On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote:
>> 
>> > Hi all,
>> > 
>> > More bgpd problems last night :( This happened last night on two of our
>> > routers. One running an old version of OpenBSD (4.3) and one running
>> > 5.1. Is there anyone out there actually using bpgd in production? How
>> > do you deal with it quitting everytime something unexpected happens on
>> > the network?
>> 
>> Yes, lots of people run it in production. 
>
> That is what I'd expect. I just don't understand how with it keep dropping
> out when it has some transient problem.
>
>> > 
>> > The first message below seems to indicate unable to allocate
>> > memory. I'm running these boxes pretty much stock having not tuned any
>> > parameters at all. Both are just running routing daemons (bgpd, ospf)
>> > and the 4.3 box is running OpenVPN. There are no applications running
>> > and both boxes have plenty of RAM (4GB) and not using any swap or
>> > anything.
>> > 
>> > Is there something I should look at tuning in terms
>> > of memory allocation in order to stop this happening?
>> > 
>> > OpenBSD 4.3/amd64:
>> > 
>> > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
>> > allocate memory
>> > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
>> > error: Cannot allocate memory
>> > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
>> > engine exited
>> > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
>> > Broken pipe
>> 
>> Only solution: upgrading. You are runing unsupported software, a
>> foolish thing to do.
>
> Alas we don't all live in Utopia ;) This box is due to be upgraded soon, 
> but that upgrade is predicated on getting a stable routing environment
> so that I can do so. At the moment we are mid-way through migrating
> away from Cisco kit to OpenBSD routers. Until I can be confident that it
> won't all just fall over I can't continue with the migration.

I would *not* want to be running ospfd from before 5.1 on a DFZ
router. First RTM_DESYNC (route socket overflows) were not dealt with
at all in ospfd until 4.8 and from then until 5.1 they tended to
result in lots of kernel route table dumps in quick succession to
get back into sync, which is pretty hard on the machine, in 5.1
a holdoff timer was introduced for these resyncs. bgpd-wise since
4.3 there have been crashes fixed triggered by bad updates (these
affected most BGP implementations not just OpenBSD) and numerous
other fixes. If you are upgrading from that version then use bsd.rd
to upgrade rather than untarring sets on the live system, and read
the upgrade notes for the intermediate versions, I think that time
period includes slight incompatible changes to bgpd.conf.

> So any insight on why I would be getting the same symptoms on the 5.1
> box? And was getting bgpd dying before under 5.0? I'm finding it hard
> to believe that this behaviour would have been tolerated by people 
> running bgpd in production all the way from the time of 4.3 to now.
> Which leads to the only conclusion... I'm doing something stupid.
> The question is what. I have ospfd and bgpd running. On the 5.1 box
> there is also a CARP interface too (not an interface we are using ospfd on).
>
> -Matt
>
>

Not sure when I started seeing it as I had various other problems
on the network and with hardware back in the 4.3 days (what's that,
4 years ago or so?) 
 

Some people don't seem to hit it at all. One of the most common
uses of OpenBGP is running as route server with mostly LAN-based
connections and I suspect this type of setup is less likely to hit
this problem. I usually only hit it on routers connected via wan
links (redundant paths with ospf which flap on occasion). Usually
hit the memory problem a few times in fairly quick succession,
then not again for sometimes as much as a couple of months or
even longer.

Without having had a way to trigger it in the lab, and in my case not
much storage on the routers to save dumps, getting more information to
help track it down is challenging.. and of course I am reliant on
out-of-band access and needing to get the network back up at that
point, and often not fully awake having been woken by a text from
icinga, so very limited debug opportunities.

If you're better able to try and get some debug information, from what
we've worked out more recently I would suggest flapping the ospf links
as possibly triggering it.



Re: More bgpd problems

2012-05-29 Thread Matt Hamilton
Philip Guenther  gmail.com> writes:

> Roger.  To paraphrase: in order for such a process to be able to dump
> core, do the following:
> 
> Create /var/empty/var/crash/ and chown it to the user that the
> [chroot'ed priv-sep'ed process] runs
> as, then set the kern.nosuidcoredump sysctl to 2.

OK, great. I've done that on all 7 boxes:

4 x OpenBSD 5.1/amd64
2 x OpenBSD 5.0/i386
1 x OpenBSD 4.3/amd64

and tested it with SIGABRT and I get a core file. So now just to sit and 
wait until it happens again.

Thanks!

-Matt



Re: More bgpd problems

2012-05-29 Thread Jiri B
On Tue, May 29, 2012 at 09:25:16PM +0200, Peter J. Philipp wrote:
> Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g).  And
> install that.

I have thought -current is compiled with debug, isn't it?

jirib



Re: More bgpd problems

2012-05-29 Thread James Shupe
On 05/29/2012 05:41 AM, Garry Dolley wrote:
> On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote:
>> Hi all,
>>
>> More bgpd problems last night :( This happened last night on two of our
>> routers. One running an old version of OpenBSD (4.3) and one running
>> 5.1. Is there anyone out there actually using bpgd in production? How
>
> Yes.  For the record I run it on OpenBSD 4.4; IPv6 traffic only.
> While there have been some quirks over the years, I've never seen it
> quit.
>

I've been running it to peer with 3 IPv4 peers and 3 IPv6 peers (full
views) and another partial IPv4 view with 12k routes (actually: varying
amounts of peers over the years, but that's the current setup) since 4.5
without needing any cron jobs to watch over it.

nrpe and ifstated run to verify the peers are up and react accordingly,
but they never trigger unless there is a physical or provider issue.
OpenBGPD has been rock solid for us.

--
James Shupe

[demime 1.01d removed an attachment of type application/pgp-signature which had 
a name of signature.asc]



Re: More bgpd problems

2012-05-29 Thread Philip Guenther
On Tue, May 29, 2012 at 12:30 PM, Henning Brauer  wrote:
> * Peter J. Philipp  [2012-05-29 21:26]:
>> 1. Make BGPD dump core
>
> it doesn't work that way due to bgpd dropping privs and chrooting.
> the way involves setting kern.nosuidcoredump to 2, but since we have
> all that already written down in an email to a non-public list, it'll
> be easiest to make that available.

Roger.  To paraphrase: in order for such a process to be able to dump
core, do the following:

Create /var/empty/var/crash/ and chown it to the user that the
[chroot'ed priv-sep'ed process] runs
as, then set the kern.nosuidcoredump sysctl to 2.


Philip Guenther



Re: More bgpd problems

2012-05-29 Thread Henning Brauer
* Peter J. Philipp  [2012-05-29 21:26]:
> 1. Make BGPD dump core

it doesn't work that way due to bgpd dropping privs and chrooting.
the way involves setting kern.nosuidcoredump to 2, but since we have
all that already written down in an email to a non-public list, it'll
be easiest to make that available.

-- 
Henning Brauer, h...@bsws.de, henn...@openbsd.org
BS Web Services, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/



Re: More bgpd problems

2012-05-29 Thread Peter J. Philipp
On Tue, May 29, 2012 at 04:21:12PM +, Matt Hamilton wrote:
> I will happily supply what I can. Just let me know how.

Hello, I've never used BGPd personally but perhaps I can help you get a
backtrace.  There is quite possibly two ways to get a backtrace.  

1. Make BGPD dump core

Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g).  And
install that.

Check the directory of the _bgpd user and make the directory writeable for
the _bgpd user.  If after another crash a bgpd.core file pops up you got it.

You can test this by sending bgpd a SIGABRT and if it didn't core something
is wrong, see #2.

You then type 'gdb /usr/sbin/bgpd bgpd.core' and type backtrace within gdb.
Type quit to exit gdb.  Keep the bgpd.core file around by saving it to another
location as it should overwrite with each subsequent segfault.

2. Attach gdb to the process and wait

Recompile the bgpd with debugging symbols (CFLAGS+=-g, LDFLAGS+=-g).  And
install that.

su to root, tmux the session and from within tmux attach to the bgpd process
"gdb /usr/sbin/bgpd " once you're attached bgpd will cease
running temporarily, just type "continue" (make sure you don't set any 
breakpoints).

You can now wait until bgpd crashes on signal 11.  gdb will break back to
the debugger command line and you can type backtrace within gdb.
Type quit to exit gdb.

When you get to it when it crashed you can attach to the tmux session with
"tmux att -d" and have before you the gdb command line.  Even better than
just a backtrace is going up and down the stack to see where the program
crashed.  Google for gdb commands.

3. Ask someone else who may have better Ideas.

> Although as you said in another post
> it is hard to replicate. All I seem to be able to see is that this happens
> during some period of network instability. It seems that there is a 
> ripple affect that something happens and that then causes a bgpd
> process to die which then propagates more changes to iBGP peers
> and they then sometimes die as well.
> 
> -Matt

Cheers,
-peter



Re: More bgpd problems

2012-05-29 Thread Matt Hamilton
Henning Brauer  bsws.de> writes:

> > OpenBSD 5.1/amd64:
> > May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine
> > terminated; signal 11
> 
> now that is bad. sig11 = segfault, Must Not Happen (tm).
> can you get us a backtrace? stuart, can we document the steps to do so
> somewhere we can point people to?

I will happily supply what I can. Just let me know how.
Although as you said in another post
it is hard to replicate. All I seem to be able to see is that this happens
during some period of network instability. It seems that there is a 
ripple affect that something happens and that then causes a bgpd
process to die which then propagates more changes to iBGP peers
and they then sometimes die as well.

-Matt



Re: More bgpd problems

2012-05-29 Thread Matt Hamilton
Otto Moerbeek  drijf.net> writes:

> According to you previous message, you are getting a different
> behaviour on the 5.1 box. A segfault is not the same as running out of mem.

I agree. It seems strangely co-incidental though that bgpd on both version 
of OpenBSD died within minutes of each other.
 
> As for the quitting problem: if a fatal error occurs, you don't have
> any other choice than to quit. A fatal error means the process cannnot
> be trusted any more. This is unsatisfactory, but the only way. 

true.

> > to believe that this behaviour would have been tolerated by people 
> > running bgpd in production all the way from the time of 4.3 to now.
> > Which leads to the only conclusion... I'm doing something stupid.
> > The question is what. I have ospfd and bgpd running. On the 5.1 box
> > there is also a CARP interface too (not an interface we are using ospfd on).
> > 
> > -Matt
> 
> There have been earlier reports of bgpd running out of mem or getting
> segfaults. In some cases that lead to fixing bugs. There might remain
> unsolved cases. 
> 
> Working with the developers is one way of getting problems resolved.
> Ranting about "I cannot believe this is happening" is not a
> constructive way to get closer to the solution. 

Sorry if you mis-understood what I wrote. I was not ranting, I was pointing 
out that as I can't believe it would be tolerated then it means I must be
doing something stupid, or different, or wrong.

-Matt



Re: More bgpd problems

2012-05-29 Thread Patrick Coleman
On 29/05/2012, at 6:08 PM, Matt Hamilton  wrote:

> Stuart Henderson  spacehopper.org> writes:
>
>> cron job to restart it, with a random delay to avoid two machines
>> coming back up at the same time when all the routers at a site
>> fail together...
>
> So you just check it every minute to see if it is alive?
>
> It seems to me to be a pretty fundamental design flaw in the software given
> its role. I would expect it to return sending a packet or something, not
> just exit.

I run it on five routers in production, balancing a couple of Internet
links and a connection to a peering point. ospfd and ospf6d handle the
internal routing. I don't have a cron job to restart it because I
wasn't aware this is necessary - its been running for a year now with
no issues. There are however a few redundant paths, so if we did lose
a router it wouldn't cause too many problems.

Installations are a mix of 5.0 and 4.7, IIRC. Hardware is Dell R610s
and R415s, plus an embedded Soekris board (at the peering point).

Cheers,

Patrick



Re: More bgpd problems

2012-05-29 Thread Otto Moerbeek
On Tue, May 29, 2012 at 10:06:37AM +, Matt Hamilton wrote:

> Otto Moerbeek  drijf.net> writes:
> 
> > 
> > On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote:
> > 
> > > Hi all,
> > > 
> > > More bgpd problems last night :( This happened last night on two of our
> > > routers. One running an old version of OpenBSD (4.3) and one running
> > > 5.1. Is there anyone out there actually using bpgd in production? How
> > > do you deal with it quitting everytime something unexpected happens on
> > > the network?
> > 
> > Yes, lots of people run it in production. 
> 
> That is what I'd expect. I just don't understand how with it keep dropping
> out when it has some transient problem.
> 
> > > 
> > > The first message below seems to indicate unable to allocate
> > > memory. I'm running these boxes pretty much stock having not tuned any
> > > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > > and the 4.3 box is running OpenVPN. There are no applications running
> > > and both boxes have plenty of RAM (4GB) and not using any swap or
> > > anything.
> > > 
> > > Is there something I should look at tuning in terms
> > > of memory allocation in order to stop this happening?
> > > 
> > > OpenBSD 4.3/amd64:
> > > 
> > > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> > > allocate memory
> > > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
> > > error: Cannot allocate memory
> > > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
> > > engine exited
> > > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
> > > Broken pipe
> > 
> > Only solution: upgrading. You are runing unsupported software, a
> > foolish thing to do.
> 
> Alas we don't all live in Utopia ;) This box is due to be upgraded soon, 
> but that upgrade is predicated on getting a stable routing environment
> so that I can do so. At the moment we are mid-way through migrating
> away from Cisco kit to OpenBSD routers. Until I can be confident that it
> won't all just fall over I can't continue with the migration.
> 
> So any insight on why I would be getting the same symptoms on the 5.1
> box? And was getting bgpd dying before under 5.0? I'm finding it hard

According to you previous message, you are getting a different
behaviour on the 5.1 box. A segfault is not the same as running out of mem.

As for the quitting problem: if a fatal error occurs, you don't have
any other choice than to quit. A fatal error means the process cannnot
be trusted any more. This is unsatisfactory, but the only way. 


> to believe that this behaviour would have been tolerated by people 
> running bgpd in production all the way from the time of 4.3 to now.
> Which leads to the only conclusion... I'm doing something stupid.
> The question is what. I have ospfd and bgpd running. On the 5.1 box
> there is also a CARP interface too (not an interface we are using ospfd on).
> 
> -Matt

There have been earlier reports of bgpd running out of mem or getting
segfaults. In some cases that lead to fixing bugs. There might remain
unsolved cases. 

Working with the developers is one way of getting problems resolved.
Ranting about "I cannot believe this is happening" is not a
constructive way to get closer to the solution. 

-Otto



Re: More bgpd problems

2012-05-29 Thread Garry Dolley
On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote:
> Hi all,
> 
> More bgpd problems last night :( This happened last night on two of our
> routers. One running an old version of OpenBSD (4.3) and one running
> 5.1. Is there anyone out there actually using bpgd in production? How

Yes.  For the record I run it on OpenBSD 4.4; IPv6 traffic only.
While there have been some quirks over the years, I've never seen it
quit.

-- 
Garry Dolley
ARP Networks, Inc. | http://www.arpnetworks.com | (818) 206-0181
Data center, VPS, and IP Transit solutions
Member Los Angeles County REACT, Unit 336 | WQGK336
Blog http://scie.nti.st



Re: More bgpd problems

2012-05-29 Thread Henning Brauer
* Matt Hamilton  [2012-05-29 12:02]:
> Stuart Henderson  spacehopper.org> writes:
> > cron job to restart it, with a random delay to avoid two machines
> > coming back up at the same time when all the routers at a site
> > fail together...
> So you just check it every minute to see if it is alive?
> 
> It seems to me to be a pretty fundamental design flaw in the software given 
> its role. I would expect it to return sending a packet or something, not 
> just exit.

it doesn't exit under normal circumstances.

bgpd is used in a lot of places, some extremely large ones too. you'd
be surprised. and no, they dont deal with "bgpd exiting constantly" or
however you called it, not at all.

> > > The first message below seems to indicate unable to allocate
> > > memory. I'm running these boxes pretty much stock having not tuned any
> > > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > > and the 4.3 box is running OpenVPN. There are no applications running
> > > and both boxes have plenty of RAM (4GB) and not using any swap or
> > > anything.
> > >
> > > Is there something I should look at tuning in terms
> > > of memory allocation in order to stop this happening?
> > 
> > Make sure login.conf memory limits for the daemon class (or the
> > _bgpd class on a newer OS version using /etc/rc.d) are high enough.
> > If your limits are insufficient for the size of routing table then
> > obviously you will have a problem. But also there is a bug
> > somewhere, possibly to do with nexthop changes, which can result
> > in very rapidly increasing memory use.

this bug is hard to trigger and we have not been able to identify a
pattern here, except that it involves iBGP.

-- 
Henning Brauer, h...@bsws.de, henn...@openbsd.org
BS Web Services, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/



Re: More bgpd problems

2012-05-29 Thread Henning Brauer
* Matt Hamilton  [2012-05-29 10:59]:
> OpenBSD 4.3/amd64:
> 
> May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> allocate memory

out of memory.

others have said enuff about running 4.3.

> OpenBSD 5.1/amd64:
> May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine
> terminated; signal 11

now that is bad. sig11 = segfault, Must Not Happen (tm).
can you get us a backtrace? stuart, can we document the steps to do so
somewhere we can point people to?

-- 
Henning Brauer, h...@bsws.de, henn...@openbsd.org
BS Web Services, http://bsws.de, Full-Service ISP
Secure Hosting, Mail and DNS Services. Dedicated Servers, Root to Fully Managed
Henning Brauer Consulting, http://henningbrauer.com/



Re: More bgpd problems

2012-05-29 Thread Otto Moerbeek
On Tue, May 29, 2012 at 10:00:53AM +, Matt Hamilton wrote:

> Stuart Henderson  spacehopper.org> writes:
> 
> > cron job to restart it, with a random delay to avoid two machines
> > coming back up at the same time when all the routers at a site
> > fail together...
> 
> So you just check it every minute to see if it is alive?
> 
> It seems to me to be a pretty fundamental design flaw in the software given 
> its role. I would expect it to return sending a packet or something, not 
> just exit.
>  
> > > The first message below seems to indicate unable to allocate
> > > memory. I'm running these boxes pretty much stock having not tuned any
> > > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > > and the 4.3 box is running OpenVPN. There are no applications running
> > > and both boxes have plenty of RAM (4GB) and not using any swap or
> > > anything.
> > >
> > > Is there something I should look at tuning in terms
> > > of memory allocation in order to stop this happening?
> > 
> > Make sure login.conf memory limits for the daemon class (or the
> > _bgpd class on a newer OS version using /etc/rc.d) are high enough.
> > If your limits are insufficient for the size of routing table then
> > obviously you will have a problem. But also there is a bug
> > somewhere, possibly to do with nexthop changes, which can result
> > in very rapidly increasing memory use.
> 
> Currently my routing table is pretty small. Only something like 150 
> routes. This will increase once we start taking full feeds. At the moment 
> we only have a few partial feeds from networks we peer with and everything 
> else goes out a default route.
> 
> I don't think it is a memory issue with the process itself, but the error 
> message seems to be more related to memory available to send the packet. 
> This is why I'm wondering if there is some sysctl or similar somewhere 
> I should be tweaking.
> 
> -Matt

the 4.x error and the 5.1 error are unrelated. Your first task should
be to upgrade the 4.x machine.

-Otto



Re: More bgpd problems

2012-05-29 Thread Matt Hamilton
Otto Moerbeek  drijf.net> writes:

> 
> On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote:
> 
> > Hi all,
> > 
> > More bgpd problems last night :( This happened last night on two of our
> > routers. One running an old version of OpenBSD (4.3) and one running
> > 5.1. Is there anyone out there actually using bpgd in production? How
> > do you deal with it quitting everytime something unexpected happens on
> > the network?
> 
> Yes, lots of people run it in production. 

That is what I'd expect. I just don't understand how with it keep dropping
out when it has some transient problem.

> > 
> > The first message below seems to indicate unable to allocate
> > memory. I'm running these boxes pretty much stock having not tuned any
> > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > and the 4.3 box is running OpenVPN. There are no applications running
> > and both boxes have plenty of RAM (4GB) and not using any swap or
> > anything.
> > 
> > Is there something I should look at tuning in terms
> > of memory allocation in order to stop this happening?
> > 
> > OpenBSD 4.3/amd64:
> > 
> > May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> > allocate memory
> > May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
> > error: Cannot allocate memory
> > May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
> > engine exited
> > May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
> > Broken pipe
> 
> Only solution: upgrading. You are runing unsupported software, a
> foolish thing to do.

Alas we don't all live in Utopia ;) This box is due to be upgraded soon, 
but that upgrade is predicated on getting a stable routing environment
so that I can do so. At the moment we are mid-way through migrating
away from Cisco kit to OpenBSD routers. Until I can be confident that it
won't all just fall over I can't continue with the migration.

So any insight on why I would be getting the same symptoms on the 5.1
box? And was getting bgpd dying before under 5.0? I'm finding it hard
to believe that this behaviour would have been tolerated by people 
running bgpd in production all the way from the time of 4.3 to now.
Which leads to the only conclusion... I'm doing something stupid.
The question is what. I have ospfd and bgpd running. On the 5.1 box
there is also a CARP interface too (not an interface we are using ospfd on).

-Matt



Re: More bgpd problems

2012-05-29 Thread Matt Hamilton
Stuart Henderson  spacehopper.org> writes:

> cron job to restart it, with a random delay to avoid two machines
> coming back up at the same time when all the routers at a site
> fail together...

So you just check it every minute to see if it is alive?

It seems to me to be a pretty fundamental design flaw in the software given 
its role. I would expect it to return sending a packet or something, not 
just exit.
 
> > The first message below seems to indicate unable to allocate
> > memory. I'm running these boxes pretty much stock having not tuned any
> > parameters at all. Both are just running routing daemons (bgpd, ospf)
> > and the 4.3 box is running OpenVPN. There are no applications running
> > and both boxes have plenty of RAM (4GB) and not using any swap or
> > anything.
> >
> > Is there something I should look at tuning in terms
> > of memory allocation in order to stop this happening?
> 
> Make sure login.conf memory limits for the daemon class (or the
> _bgpd class on a newer OS version using /etc/rc.d) are high enough.
> If your limits are insufficient for the size of routing table then
> obviously you will have a problem. But also there is a bug
> somewhere, possibly to do with nexthop changes, which can result
> in very rapidly increasing memory use.

Currently my routing table is pretty small. Only something like 150 
routes. This will increase once we start taking full feeds. At the moment 
we only have a few partial feeds from networks we peer with and everything 
else goes out a default route.

I don't think it is a memory issue with the process itself, but the error 
message seems to be more related to memory available to send the packet. 
This is why I'm wondering if there is some sysctl or similar somewhere 
I should be tweaking.

-Matt



Re: More bgpd problems

2012-05-29 Thread Otto Moerbeek
On Tue, May 29, 2012 at 08:57:54AM +, Matt Hamilton wrote:

> Hi all,
> 
> More bgpd problems last night :( This happened last night on two of our
> routers. One running an old version of OpenBSD (4.3) and one running
> 5.1. Is there anyone out there actually using bpgd in production? How
> do you deal with it quitting everytime something unexpected happens on
> the network?

Yes, lots of people run it in production. 

> 
> The first message below seems to indicate unable to allocate
> memory. I'm running these boxes pretty much stock having not tuned any
> parameters at all. Both are just running routing daemons (bgpd, ospf)
> and the 4.3 box is running OpenVPN. There are no applications running
> and both boxes have plenty of RAM (4GB) and not using any swap or
> anything.
> 
> Is there something I should look at tuning in terms
> of memory allocation in order to stop this happening?
> 
> OpenBSD 4.3/amd64:
> 
> May 29 05:53:43 firewall1 bgpd[5090]: imsg_create: buf_open: Cannot
> allocate memory
> May 29 05:53:43 firewall1 bgpd[5090]: fatal in RDE: imsg_compose
> error: Cannot allocate memory
> May 29 05:53:44 firewall1 bgpd[27053]: Lost child: route decision
> engine exited
> May 29 05:53:44 firewall1 bgpd[15204]: fatal in SE: pipe write error:
> Broken pipe

Only solution: upgrading. You are runing unsupported software, a
foolish thing to do.

> 
> OpenBSD 5.1/amd64:
> 
> May 29 05:55:09 fw1 bgpd[21316]: Lost child: route decision engine
> terminated; signal 11
> May 29 05:55:09 fw1 bgpd[21459]: fatal in SE: pipe write error: Broken
> pipe

This is a real issue. I'll leave this one to people more experienced
running bgpd. 

-Otto



Re: More bgpd problems

2012-05-29 Thread Stuart Henderson
On 2012-05-29, Matt Hamilton  wrote:
> More bgpd problems last night :( This happened last night on two of our
> routers. One running an old version of OpenBSD (4.3) and one running
> 5.1. Is there anyone out there actually using bpgd in production?

Yes.

> How
> do you deal with it quitting everytime something unexpected happens on
> the network?

cron job to restart it, with a random delay to avoid two machines
coming back up at the same time when all the routers at a site
fail together...

> The first message below seems to indicate unable to allocate
> memory. I'm running these boxes pretty much stock having not tuned any
> parameters at all. Both are just running routing daemons (bgpd, ospf)
> and the 4.3 box is running OpenVPN. There are no applications running
> and both boxes have plenty of RAM (4GB) and not using any swap or
> anything.
>
> Is there something I should look at tuning in terms
> of memory allocation in order to stop this happening?

Make sure login.conf memory limits for the daemon class (or the
_bgpd class on a newer OS version using /etc/rc.d) are high enough.
If your limits are insufficient for the size of routing table then
obviously you will have a problem. But also there is a bug
somewhere, possibly to do with nexthop changes, which can result
in very rapidly increasing memory use.