Bug#779238: lldpd: LLDPd daemon freezes
Hi again, comments inline. On 02/27/2015 12:16 AM, Vincent Bernat wrote: ❦ 26 février 2015 21:30 +0200, Kiousis Alexandros : This happens when there is a tap interface that is down from the VM side. So when LLDPd can't send the packets to the interface it writes them to the buffer, that's where the log is from. After a while the buffer space fills up and when lldpd can't write anymore, it blocks. According to what you said, this is expected in this version and is the reason why the whole daemon becomes unresponsive. When i rebooted the vm (and later when i manually up'ed the second interface) i saw by tcpdumping on the host the packets going through the tap interface. What may be a bug is that when the interface cames back, the daemon dies. There is no log about it but i can see this: Don't you get a message in the kernel log about a segfault? Nope, no log. But when i tested it again the daemon didn't crash. So at 20:07 is when the daemon resumed operation, and (i guess immediately) crashed. Puppet tries to re-up it at 20:08 but the daemon exits because the lldpd.socket already exists. This sound like a bug but I don't know if it's worth looking into. That's already fixed in later versions. Now, the socket is tested to check if someone is listening and if not, deleted. I installed the new version from backports (0.7.11-2~bpo70+1) and it works fine now. I see the following log : Mar 2 10:49:25 gnt5-03 lldpd[20916]: unable to send packet on real device for tap14: Resource temporarily unavailable which means we are ok. I guess this ticket can close. Thanks for all your help. -- --- Αλέξανδρος Κιούσης - al...@noc.grnet.gr GRNET NOC System Administrator - http://noc.grnet.gr --- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#779238: lldpd: LLDPd daemon freezes
❦ 26 février 2015 21:30 +0200, Kiousis Alexandros : > This happens when there is a tap interface that is down from the VM > side. So when LLDPd can't send the packets to the interface it writes > them to the buffer, that's where the log is from. After a while the > buffer space fills up and when lldpd can't write anymore, it > blocks. According to what you said, this is expected in this version > and is the reason why the whole daemon becomes unresponsive. > When i rebooted the vm (and later when i manually up'ed the second > interface) i saw by tcpdumping on the host the packets going through > the tap interface. > > What may be a bug is that when the interface cames back, the daemon > dies. There is no log about it but i can see this: Don't you get a message in the kernel log about a segfault? > So at 20:07 is when the daemon resumed operation, and (i guess > immediately) crashed. Puppet tries to re-up it at 20:08 but the daemon > exits because the lldpd.socket already exists. This sound like a bug > but I don't know if it's worth looking into. That's already fixed in later versions. Now, the socket is tested to check if someone is listening and if not, deleted. -- My only love sprung from my only hate! Too early seen unknown, and known too late! -- William Shakespeare, "Romeo and Juliet" signature.asc Description: PGP signature
Bug#779238: lldpd: LLDPd daemon freezes
So i guess we figured it out. This happens when there is a tap interface that is down from the VM side. So when LLDPd can't send the packets to the interface it writes them to the buffer, that's where the log is from. After a while the buffer space fills up and when lldpd can't write anymore, it blocks. According to what you said, this is expected in this version and is the reason why the whole daemon becomes unresponsive. When i rebooted the vm (and later when i manually up'ed the second interface) i saw by tcpdumping on the host the packets going through the tap interface. What may be a bug is that when the interface cames back, the daemon dies. There is no log about it but i can see this: Feb 26 20:07:22 gnt5-03 lldpd[17754]: lldp_decode: unknown org tlv received on eth0 Feb 26 20:08:02 gnt5-03 puppet-agent[3614]: (/Stage[main]/Lldp/Service[lldpd]/ensure) ensure changed 'stopped' to 'running' Feb 26 20:08:02 gnt5-03 lldpd[15562]: asroot_ctl_create: [priv]: unable to create control socket: Address already in use Feb 26 20:08:02 gnt5-03 lldpd[15563]: fatal: unable to create control socket /var/run/lldpd.socket So at 20:07 is when the daemon resumed operation, and (i guess immediately) crashed. Puppet tries to re-up it at 20:08 but the daemon exits because the lldpd.socket already exists. This sound like a bug but I don't know if it's worth looking into. Thanks for all you help. I guess i need to look into the backported version since i have no way to ensure that the nics are active on all the vms. On 02/26/2015 04:58 PM, Vincent Bernat wrote: ❦ 26 février 2015 14:32 +0200, Kiousis Alexandros : lldpd 17754 _lldpd 32u pack 434712919 0t0 ALL type=SOCK_RAW So, it is blocking on the tap device. Maybe the VM on the other side has been rebooted. Could you check if there is anything odd with "ss --info --extended --memory -anp | grep 434712919"? It should display which interface is associated to file descriptor 32. Starting from lldpd 0.6, the whole event model has been rewritten and is using libevent. Moreover, the socket is made non blocking. Therefore, inability to write won't block lldpd anymore if you use the version from backports. -- --- Αλέξανδρος Κιούσης - al...@noc.grnet.gr GRNET NOC System Administrator - http://noc.grnet.gr --- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#779238: lldpd: LLDPd daemon freezes
❦ 26 février 2015 14:32 +0200, Kiousis Alexandros : > lldpd 17754 _lldpd 32u pack 434712919 0t0 ALL > type=SOCK_RAW So, it is blocking on the tap device. Maybe the VM on the other side has been rebooted. Could you check if there is anything odd with "ss --info --extended --memory -anp | grep 434712919"? It should display which interface is associated to file descriptor 32. Starting from lldpd 0.6, the whole event model has been rewritten and is using libevent. Moreover, the socket is made non blocking. Therefore, inability to write won't block lldpd anymore if you use the version from backports. -- No group of professionals meets except to conspire against the public at large. -- Mark Twain signature.asc Description: PGP signature
Bug#779238: lldpd: LLDPd daemon freezes
This is the output: # lsof -n -p 17752 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME lldpd 17752 root cwdDIR 254,0 4096 2 / lldpd 17752 root rtdDIR 254,0 4096 2 / lldpd 17752 root txtREG 254,0 142520 1910045 /usr/sbin/lldpd lldpd 17752 root memREG 254,047616 1049635 /lib/x86_64-linux-gnu/libnss_files-2.13.so lldpd 17752 root memREG 254,043560 1049647 /lib/x86_64-linux-gnu/libnss_nis-2.13.so lldpd 17752 root memREG 254,031584 1049648 /lib/x86_64-linux-gnu/libnss_compat-2.13.so lldpd 17752 root memREG 254,092752 1049671 /lib/x86_64-linux-gnu/libz.so.1.2.7 lldpd 17752 root memREG 254,089056 1049639 /lib/x86_64-linux-gnu/libnsl-2.13.so lldpd 17752 root memREG 254,0 2048512 1982666 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 lldpd 17752 root memREG 254,059712 1983102 /usr/lib/x86_64-linux-gnu/libsensors.so.4.3.2 lldpd 17752 root memREG 254,035104 1049644 /lib/x86_64-linux-gnu/libcrypt-2.13.so lldpd 17752 root memREG 254,0 131107 1049651 /lib/x86_64-linux-gnu/libpthread-2.13.so lldpd 17752 root memREG 254,0 530736 1049638 /lib/x86_64-linux-gnu/libm-2.13.so lldpd 17752 root memREG 254,014768 1049642 /lib/x86_64-linux-gnu/libdl-2.13.so lldpd 17752 root memREG 254,0 1574680 1908792 /usr/lib/libperl.so.5.14.2 lldpd 17752 root memREG 254,040656 1049689 /lib/x86_64-linux-gnu/libwrap.so.0.7.6 lldpd 17752 root memREG 254,0 1603600 1049640 /lib/x86_64-linux-gnu/libc-2.13.so lldpd 17752 root memREG 254,0 649008 1909575 /usr/lib/libnetsnmp.so.15.1.2 lldpd 17752 root memREG 254,0 1195544 1909592 /usr/lib/libnetsnmpmibs.so.15.1.2 lldpd 17752 root memREG 254,0 151376 1909578 /usr/lib/libnetsnmphelpers.so.15.1.2 lldpd 17752 root memREG 254,0 297904 1909560 /usr/lib/libnetsnmpagent.so.15.1.2 lldpd 17752 root memREG 254,0 136936 1049645 /lib/x86_64-linux-gnu/ld-2.13.so lldpd 17752 root0u CHR1,3 0t0 1028 /dev/null lldpd 17752 root1u CHR1,3 0t0 1028 /dev/null lldpd 17752 root2u CHR1,3 0t0 1028 /dev/null lldpd 17752 root3u unix 0x8802f8bed580 0t0 434727198 socket lldpd 17752 root4u sock0,7 0t0 434726723 can't identify protocol lldpd 17752 root5u unix 0x88147df7f740 0t0 434726718 socket # lsof -n -p 17754 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME lldpd 17754 _lldpd cwdDIR 0,14 60 5332 /run/lldpd lldpd 17754 _lldpd rtdDIR 0,14 60 5332 /run/lldpd lldpd 17754 _lldpd txtREG 254,0 142520 1910045 /usr/sbin/lldpd lldpd 17754 _lldpd memREG 254,047616 1049635 /lib/x86_64-linux-gnu/libnss_files-2.13.so lldpd 17754 _lldpd memREG 254,043560 1049647 /lib/x86_64-linux-gnu/libnss_nis-2.13.so lldpd 17754 _lldpd memREG 254,031584 1049648 /lib/x86_64-linux-gnu/libnss_compat-2.13.so lldpd 17754 _lldpd memREG 254,092752 1049671 /lib/x86_64-linux-gnu/libz.so.1.2.7 lldpd 17754 _lldpd memREG 254,089056 1049639 /lib/x86_64-linux-gnu/libnsl-2.13.so lldpd 17754 _lldpd memREG 254,0 2048512 1982666 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 lldpd 17754 _lldpd memREG 254,059712 1983102 /usr/lib/x86_64-linux-gnu/libsensors.so.4.3.2 lldpd 17754 _lldpd memREG 254,035104 1049644 /lib/x86_64-linux-gnu/libcrypt-2.13.so lldpd 17754 _lldpd memREG 254,0 131107 1049651 /lib/x86_64-linux-gnu/libpthread-2.13.so lldpd 17754 _lldpd memREG 254,0 530736 1049638 /lib/x86_64-linux-gnu/libm-2.13.so lldpd 17754 _lldpd memREG 254,014768 1049642 /lib/x86_64-linux-gnu/libdl-2.13.so lldpd 17754 _lldpd memREG 254,0 1574680 1908792 /usr/lib/libperl.so.5.14.2 lldpd 17754 _lldpd memREG 254,040656 1049689 /lib/x86_64-linux-gnu/libwrap.so.0.7.6 lldpd 17754 _lldpd memREG 254,0 1603600 1049640 /lib/x86_64-linux-gnu/libc-2.13.so lldpd 17754 _lldpd memREG 254,0 649008 1909575 /usr/lib/libnetsnmp.so.15.1.2 lldpd 17754 _lldpd memREG 254,0 1195544 1909592 /usr/lib/libnetsnmpmibs.so.15.1.2 lldpd 1775
Bug#779238: lldpd: LLDPd daemon freezes
❦ 26 février 2015 12:36 +0200, Kiousis Alexandros : > This is what i get now from a stuck daemon: > > # ps aux | grep lldp | grep -v grep > root 17752 0.0 0.0 46468 1432 ?SNs Feb25 0:11 > /usr/sbin/lldpd -I eth*,tap* > _lldpd 17754 0.0 0.0 46468 1080 ?SFeb25 0:11 > /usr/sbin/lldpd -I eth*,tap* > # strace -p 17752 > Process 17752 attached - interrupt to quit > read(5, ^C > Process 17752 detached > # strace -p 17754 > Process 17754 attached - interrupt to quit > write(32, > "\1\200\302\0\0\16\222\207K\365\214\37\210\314\2\7\4<\331+\1\31<\4\7\3\222\207K\365\214\37"..., > 195^C > Process 17754 detached In each case, also give the output of "lsof -n -p 17754" and "lsof -n -p 17752", so we know what 5 and 32 are. 5 is likely to be the read pipe of the monitor process, so, it's normal to be stuck here. 32 is likely to be the tap device and it's not normal to be stuck here. I'll investigate a bit to check if the interface sockets are set in blocking or non-blocking mode in this version. -- Test input for validity and plausibility. - The Elements of Programming Style (Kernighan & Plauger) signature.asc Description: PGP signature
Bug#779238: lldpd: LLDPd daemon freezes
On 02/25/2015 09:20 PM, Vincent Bernat wrote: What kind of VM is that? It's odd to run out of space on those tap interfaces. This can be the cause of lldpd hanging since writing to the devices is done synchronously while the remaining of the daemon is asynchronous. Just to be clear this doesn't happen on a particular host/vm combo but there is a correlation between lldp getting stuck and this log (i.e. no buffer on a tap if). This is a public vps infrastructure so i can't really tell if the vms on these interfaces do something weird. In this instance the vm is ours but it seems to be working fine. Maybe you could get the output of strace the next time you run into the problem (strace -p pidoflldpd). There are two processes but the one likely to be blocked is the one running as _lldpd user. This is what i get now from a stuck daemon: # ps aux | grep lldp | grep -v grep root 17752 0.0 0.0 46468 1432 ?SNs Feb25 0:11 /usr/sbin/lldpd -I eth*,tap* _lldpd 17754 0.0 0.0 46468 1080 ?SFeb25 0:11 /usr/sbin/lldpd -I eth*,tap* # strace -p 17752 Process 17752 attached - interrupt to quit read(5, ^C Process 17752 detached # strace -p 17754 Process 17754 attached - interrupt to quit write(32, "\1\200\302\0\0\16\222\207K\365\214\37\210\314\2\7\4<\331+\1\31<\4\7\3\222\207K\365\214\37"..., 195^C Process 17754 detached -- --- Αλέξανδρος Κιούσης - al...@noc.grnet.gr GRNET NOC System Administrator - http://noc.grnet.gr --- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#779238: lldpd: LLDPd daemon freezes
❦ 25 février 2015 20:34 +0200, Alex Kiousis : > We use lldp to get information about the parents of our hosts (about > 100 nodes). Some time ago we switched to also broadcasting lldp > information to the tap interfaces so the vms running on the machine > can get that information. Since then lldpd has been randomly and on > rare occasions (i.e. one host per week) freezing completely. The > proccess is still alive but lldcpctl gets stuck and the daemon is not > sending data (lldpctl from a vm is empty). Restarting the daemon fixes > it. > > We also see these logs which must be relevant: > lldp_send: unable to send packet on real device for tap14: No buffer space > available > > I haven't tested with the newer package available on backports but > since the bug is difficult to reproduce What kind of VM is that? It's odd to run out of space on those tap interfaces. This can be the cause of lldpd hanging since writing to the devices is done synchronously while the remaining of the daemon is asynchronous. Maybe you could get the output of strace the next time you run into the problem (strace -p pidoflldpd). There are two processes but the one likely to be blocked is the one running as _lldpd user. -- Use debugging compilers. - The Elements of Programming Style (Kernighan & Plauger) signature.asc Description: PGP signature
Bug#779238: lldpd: LLDPd daemon freezes
Package: lldpd Version: 0.5.7-2 Severity: normal Dear Maintainer, We use lldp to get information about the parents of our hosts (about 100 nodes). Some time ago we switched to also broadcasting lldp information to the tap interfaces so the vms running on the machine can get that information. Since then lldpd has been randomly and on rare occasions (i.e. one host per week) freezing completely. The proccess is still alive but lldcpctl gets stuck and the daemon is not sending data (lldpctl from a vm is empty). Restarting the daemon fixes it. We also see these logs which must be relevant: lldp_send: unable to send packet on real device for tap14: No buffer space available I haven't tested with the newer package available on backports but since the bug is difficult to reproduce -- System Information: Debian Release: 7.8 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 3.2.0-4-amd64 (SMP w/24 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages lldpd depends on: ii adduser3.113+nmu3 ii libc6 2.13-38+deb7u7 ii libsnmp15 5.4.3~dfsg-2.8+deb7u1 ii libxml22.8.0+dfsg1-7+wheezy2 lldpd recommends no packages. Versions of packages lldpd suggests: pn snmpd -- Configuration Files: /etc/default/lldpd changed: DAEMON_ARGS="-I eth*,tap*" -- no debconf information -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org