Hi, I recently had an interesting problem surrounding socket option buffers and its use in Bird on Linux 3.6 which I hope someone could shed some light on.
Quite frequently see the following in our logs when enabling/starting BGP sessions when configured to use MD5 auth. <snip> Sep 24 23:12:46 rtr2 bird: sk_set_md5_auth_int: setsockopt: No such file or directory </snip> These never seem to cause any functionality problems, but seemed strange / maybe related to my new ongoing issue. ;) ## My functionality impacting problem ## Yesterday after some upstream BGP peers has connectivity issues (Hold timer expired), all of my previously working BGP sessions (using MD5 auth) attempted to reconnect and gave me the following in the logs: <snip> Oct 9 18:02:08 rtr2 bird: plxhq: Error: Hold timer expired Oct 9 18:02:08 rtr2 bird: plxhq: BGP session closed Oct 9 18:02:08 rtr2 bird: plxhq: State changed to flush Oct 9 18:02:08 rtr2 bird: plxhq: State changed to stop Oct 9 18:02:08 rtr2 bird: sk_set_md5_auth_int: setsockopt: No such file or directory Oct 9 18:02:08 rtr2 bird: plxhq: Down Oct 9 18:02:08 rtr2 bird: plxhq: Starting Oct 9 18:02:08 rtr2 bird: sk_set_md5_auth_int: setsockopt: Cannot allocate memory </snip> At which point the BGP session fails to establish/start, and all subsequent BGP sessions that are started (with MD5 Auth) also fail with the same message. Looking through the bird code it seems bird issues some socket control messages to update the TCP socket with MD5 parameters. And after digging around in the Linux system it seems I was running out of socket option memory buffers (duh!). Thusly I was able to "fix" this by issuing: <snip> echo 40960 > /proc/sys/net/core/optmem_max # Defaults to 20480 </snip> Is this expected? Any insight on how to properly size the socket option memory buffers used by bird? Is this some sort of a socket buffer leak? <snip> bird> show memory BIRD memory usage Routing tables: 307 MB Route attributes: 106 MB ROA tables: 192 B Protocols: 388 kB Total: 413 MB $ uptime 16:45:03 up 343 days, 1:23, 1 user, load average: 0.00, 0.03, 0.05 </snip> I have multiple identical machines running the same os/software/configuration and so far only one of them has shown this behavior. Thanks! -Mike -- Michael Vallaly <mvall...@nolatency.com>