Hi Pietro, On Thu, Aug 23, 2018 at 10:05:30AM +0200, Pietro Stäheli wrote: > Hi, > > openBGPd is running at an internet exchange, two openBSD route servers > (rs3 on openBSD 6.3 and rs4 on openBSD 6.2, both virtual machines on > different hypervisors in different locations) connect with peering > customers. > > We've experienced crashes in openBGPd twice in the past two weeks. Both > times with the same error message: "fatal in RDE: Uh, oh a politician in > the decision process". These error messages are logged on both route > servers right before they crash within seconds of each other. > > The route servers had been running quite reliably for a long time before > the recent incidents. The daemon can then be restarted without an issue. > CPU usage prior to the crash is minimal (<5%). > > In the minutes before the crash we're seeing error messages like the > following in daemon.log: > > bgpd[23099]: no such peer: id=4294967037 > > > Sample of logs just before the crash: > > Aug 22 15:38:58 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170 > AS6939: update 81.163.124.0/24 via 91.206.52.170 > Aug 22 15:38:58 rs3 bgpd[23099]: no such peer: id=4294967037 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: withdraw 2a01:6a8::/32 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: withdraw 2a01:6a8::/32 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: withdraw 2804:364c::/33 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: withdraw 2804:364c:8000::/33 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11 > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: update 2804:364c::/33 via 2001:7f8:24::aa > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa > AS6939: update 2804:364c:8000::/33 via 2001:7f8:24::aa > Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf > Aug 22 15:39:00 rs3 bgpd[23099]: Connection attempt from neighbor > 91.206.52.139 while session is in state Idle > Aug 22 15:39:01 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.96 > AS31042: update 185.64.172.0/24 via 91.206.52.96 > Aug 22 15:39:01 rs3 bgpd[43763]: fatal in RDE: Uh, oh a politician in > the decision process > Aug 22 15:39:01 rs3 bgpd[99961]: peer closed imsg connection > Aug 22 15:39:01 rs3 bgpd[99961]: main: Lost connection to RDE > Aug 22 15:39:01 rs3 bgpd[23099]: peer closed imsg connection > Aug 22 15:39:01 rs3 bgpd[23099]: SE: Lost connection to parent > > > Logs just before the "no such peer" messages appear: > > Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.54 > AS34554: update 80.75.112.0/20 via 91.206.52.54 > Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::36 > AS34554: update 2a01:6a8::/32 via 2001:7f8:24::36 > Aug 22 15:36:44 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf > Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: graceful restart > of IPv4 unicast, keeping routes > Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: state change > Established -> Idle, reason: Connection closed > Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: removed > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: withdraw 2a01:6a8::/32 > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: withdraw 2a01:6a8::/32 > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11 > AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11 > Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf > Aug 22 15:36:54 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf > AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf > Aug 22 15:36:55 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170 > AS6939: update 197.249.160.0/19 via 91.206.52.170 > Aug 22 15:36:55 rs3 bgpd[23099]: no such peer: id=4294967037 > > > > I haven't found much about the error message apart from this mailing > list thread: https://www.mail-archive.com/misc@openbsd.org/msg04565.html > > The thread suggests that invoking bgpctl may cause the failure. 'bgpctl > show' is invoked every few minutes through our monitoring system to > check on the status of peer connections. > > Can anybody point me to a possible cause or troubleshooting regarding > this issue? Could a misconfigured/broken peer be the cause? Has anybody > dealt with a similar problem? > > I can provide bgpd.conf and full logs of both incidents if necessary.
I move this over to bugs@ which feels more appropriate to me. Chances that a bgpd developer sees it is higher. Somehow there where two identical paths and bgpd could not figure out which one to prefer. The 13 steps bgpd goes through to find the best are listed in the bgpd man page. Do you have a pcap file with the bgp traffic or an mrt dump? (see dump in man bgpd.conf). This could answer the question why you had two identical path. Showing your config and your regular bgpctl commands could also help with analyzing the problem. Remi