Bug#779238: lldpd: LLDPd daemon freezes

2015-03-02 Thread Kiousis Alexandros

Hi again,
comments inline.

On 02/27/2015 12:16 AM, Vincent Bernat wrote:

  ❦ 26 février 2015 21:30 +0200, Kiousis Alexandros  :


This happens when there is a tap interface that is down from the VM
side. So when LLDPd can't send the packets to the interface it writes
them to the buffer, that's where the log is from. After a while the
buffer space fills up and when lldpd can't write anymore, it
blocks. According to what you said, this is expected in this version
and is the reason why the whole daemon becomes unresponsive.
When i rebooted the vm (and later when i manually up'ed the second
interface) i saw by tcpdumping on the host the packets going through
the tap interface.

What may be a bug is that when the interface cames back, the daemon
dies. There is no log about it but i can see this:


Don't you get a message in the kernel log about a segfault?


Nope, no log. But when i tested it again the daemon didn't crash.


So at 20:07 is when the daemon resumed operation, and (i guess
immediately) crashed. Puppet tries to re-up it at 20:08 but the daemon
exits because the lldpd.socket already exists. This sound like a bug
but I don't know if it's worth looking into.


That's already fixed in later versions. Now, the socket is tested to
check if someone is listening and if not, deleted.



I installed the new version from backports (0.7.11-2~bpo70+1) and it 
works fine now. I see the following log :
Mar  2 10:49:25 gnt5-03 lldpd[20916]: unable to send packet on real 
device for tap14: Resource temporarily unavailable

which means we are ok.

I guess this ticket can close. Thanks for all your help.

--
---
Αλέξανδρος Κιούσης - al...@noc.grnet.gr
GRNET NOC System Administrator - http://noc.grnet.gr
---


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#779238: lldpd: LLDPd daemon freezes

2015-02-26 Thread Vincent Bernat
 ❦ 26 février 2015 21:30 +0200, Kiousis Alexandros  :

> This happens when there is a tap interface that is down from the VM
> side. So when LLDPd can't send the packets to the interface it writes
> them to the buffer, that's where the log is from. After a while the
> buffer space fills up and when lldpd can't write anymore, it
> blocks. According to what you said, this is expected in this version
> and is the reason why the whole daemon becomes unresponsive.
> When i rebooted the vm (and later when i manually up'ed the second
> interface) i saw by tcpdumping on the host the packets going through
> the tap interface.
>
> What may be a bug is that when the interface cames back, the daemon
> dies. There is no log about it but i can see this:

Don't you get a message in the kernel log about a segfault?

> So at 20:07 is when the daemon resumed operation, and (i guess
> immediately) crashed. Puppet tries to re-up it at 20:08 but the daemon
> exits because the lldpd.socket already exists. This sound like a bug
> but I don't know if it's worth looking into.

That's already fixed in later versions. Now, the socket is tested to
check if someone is listening and if not, deleted.
-- 
My only love sprung from my only hate!
Too early seen unknown, and known too late!
-- William Shakespeare, "Romeo and Juliet"


signature.asc
Description: PGP signature


Bug#779238: lldpd: LLDPd daemon freezes

2015-02-26 Thread Kiousis Alexandros

So i guess we figured it out.

This happens when there is a tap interface that is down from the VM 
side. So when LLDPd can't send the packets to the interface it writes 
them to the buffer, that's where the log is from. After a while the 
buffer space fills up and when lldpd can't write anymore, it blocks. 
According to what you said, this is expected in this version and is the 
reason why the whole daemon becomes unresponsive.
When i rebooted the vm (and later when i manually up'ed the second 
interface) i saw by tcpdumping on the host the packets going through the 
tap interface.


What may be a bug is that when the interface cames back, the daemon 
dies. There is no log about it but i can see this:


Feb 26 20:07:22 gnt5-03 lldpd[17754]: lldp_decode: unknown org tlv 
received on eth0
Feb 26 20:08:02 gnt5-03 puppet-agent[3614]: 
(/Stage[main]/Lldp/Service[lldpd]/ensure) ensure changed 'stopped' to 
'running'
Feb 26 20:08:02 gnt5-03 lldpd[15562]: asroot_ctl_create: [priv]: unable 
to create control socket: Address already in use
Feb 26 20:08:02 gnt5-03 lldpd[15563]: fatal: unable to create control 
socket /var/run/lldpd.socket


So at 20:07 is when the daemon resumed operation, and (i guess 
immediately) crashed. Puppet tries to re-up it at 20:08 but the daemon 
exits because the lldpd.socket already exists. This sound like a bug but 
I don't know if it's worth looking into.


Thanks for all you help. I guess i need to look into the backported 
version since i have no way to ensure that the nics are active on all 
the vms.




On 02/26/2015 04:58 PM, Vincent Bernat wrote:

  ❦ 26 février 2015 14:32 +0200, Kiousis Alexandros  :


lldpd   17754 _lldpd   32u  pack  434712919  0t0   ALL
type=SOCK_RAW


So, it is blocking on the tap device. Maybe the VM on the other side has
been rebooted. Could you check if there is anything odd with
"ss --info --extended --memory -anp | grep 434712919"?

It should display which interface is associated to file descriptor 32.

Starting from lldpd 0.6, the whole event model has been rewritten and is
using libevent. Moreover, the socket is made non blocking. Therefore,
inability to write won't block lldpd anymore if you use the version from
backports.




--
---
Αλέξανδρος Κιούσης - al...@noc.grnet.gr
GRNET NOC System Administrator - http://noc.grnet.gr
---


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#779238: lldpd: LLDPd daemon freezes

2015-02-26 Thread Vincent Bernat
 ❦ 26 février 2015 14:32 +0200, Kiousis Alexandros  :

> lldpd   17754 _lldpd   32u  pack  434712919  0t0   ALL
> type=SOCK_RAW

So, it is blocking on the tap device. Maybe the VM on the other side has
been rebooted. Could you check if there is anything odd with
"ss --info --extended --memory -anp | grep 434712919"?

It should display which interface is associated to file descriptor 32.

Starting from lldpd 0.6, the whole event model has been rewritten and is
using libevent. Moreover, the socket is made non blocking. Therefore,
inability to write won't block lldpd anymore if you use the version from
backports.
-- 
No group of professionals meets except to conspire against the public at large.
-- Mark Twain


signature.asc
Description: PGP signature


Bug#779238: lldpd: LLDPd daemon freezes

2015-02-26 Thread Kiousis Alexandros

This is the output:

# lsof  -n -p 17752
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
lldpd   17752 root  cwdDIR  254,0 4096 2 /
lldpd   17752 root  rtdDIR  254,0 4096 2 /
lldpd   17752 root  txtREG  254,0   142520   1910045 
/usr/sbin/lldpd
lldpd   17752 root  memREG  254,047616   1049635 
/lib/x86_64-linux-gnu/libnss_files-2.13.so
lldpd   17752 root  memREG  254,043560   1049647 
/lib/x86_64-linux-gnu/libnss_nis-2.13.so
lldpd   17752 root  memREG  254,031584   1049648 
/lib/x86_64-linux-gnu/libnss_compat-2.13.so
lldpd   17752 root  memREG  254,092752   1049671 
/lib/x86_64-linux-gnu/libz.so.1.2.7
lldpd   17752 root  memREG  254,089056   1049639 
/lib/x86_64-linux-gnu/libnsl-2.13.so
lldpd   17752 root  memREG  254,0  2048512   1982666 
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
lldpd   17752 root  memREG  254,059712   1983102 
/usr/lib/x86_64-linux-gnu/libsensors.so.4.3.2
lldpd   17752 root  memREG  254,035104   1049644 
/lib/x86_64-linux-gnu/libcrypt-2.13.so
lldpd   17752 root  memREG  254,0   131107   1049651 
/lib/x86_64-linux-gnu/libpthread-2.13.so
lldpd   17752 root  memREG  254,0   530736   1049638 
/lib/x86_64-linux-gnu/libm-2.13.so
lldpd   17752 root  memREG  254,014768   1049642 
/lib/x86_64-linux-gnu/libdl-2.13.so
lldpd   17752 root  memREG  254,0  1574680   1908792 
/usr/lib/libperl.so.5.14.2
lldpd   17752 root  memREG  254,040656   1049689 
/lib/x86_64-linux-gnu/libwrap.so.0.7.6
lldpd   17752 root  memREG  254,0  1603600   1049640 
/lib/x86_64-linux-gnu/libc-2.13.so
lldpd   17752 root  memREG  254,0   649008   1909575 
/usr/lib/libnetsnmp.so.15.1.2
lldpd   17752 root  memREG  254,0  1195544   1909592 
/usr/lib/libnetsnmpmibs.so.15.1.2
lldpd   17752 root  memREG  254,0   151376   1909578 
/usr/lib/libnetsnmphelpers.so.15.1.2
lldpd   17752 root  memREG  254,0   297904   1909560 
/usr/lib/libnetsnmpagent.so.15.1.2
lldpd   17752 root  memREG  254,0   136936   1049645 
/lib/x86_64-linux-gnu/ld-2.13.so
lldpd   17752 root0u   CHR1,3  0t0  1028 
/dev/null
lldpd   17752 root1u   CHR1,3  0t0  1028 
/dev/null
lldpd   17752 root2u   CHR1,3  0t0  1028 
/dev/null

lldpd   17752 root3u  unix 0x8802f8bed580  0t0 434727198 socket
lldpd   17752 root4u  sock0,7  0t0 434726723 
can't identify protocol

lldpd   17752 root5u  unix 0x88147df7f740  0t0 434726718 socket


# lsof  -n -p 17754
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
lldpd   17754 _lldpd  cwdDIR   0,14   60  5332 
/run/lldpd
lldpd   17754 _lldpd  rtdDIR   0,14   60  5332 
/run/lldpd
lldpd   17754 _lldpd  txtREG  254,0   142520   1910045 
/usr/sbin/lldpd
lldpd   17754 _lldpd  memREG  254,047616   1049635 
/lib/x86_64-linux-gnu/libnss_files-2.13.so
lldpd   17754 _lldpd  memREG  254,043560   1049647 
/lib/x86_64-linux-gnu/libnss_nis-2.13.so
lldpd   17754 _lldpd  memREG  254,031584   1049648 
/lib/x86_64-linux-gnu/libnss_compat-2.13.so
lldpd   17754 _lldpd  memREG  254,092752   1049671 
/lib/x86_64-linux-gnu/libz.so.1.2.7
lldpd   17754 _lldpd  memREG  254,089056   1049639 
/lib/x86_64-linux-gnu/libnsl-2.13.so
lldpd   17754 _lldpd  memREG  254,0  2048512   1982666 
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
lldpd   17754 _lldpd  memREG  254,059712   1983102 
/usr/lib/x86_64-linux-gnu/libsensors.so.4.3.2
lldpd   17754 _lldpd  memREG  254,035104   1049644 
/lib/x86_64-linux-gnu/libcrypt-2.13.so
lldpd   17754 _lldpd  memREG  254,0   131107   1049651 
/lib/x86_64-linux-gnu/libpthread-2.13.so
lldpd   17754 _lldpd  memREG  254,0   530736   1049638 
/lib/x86_64-linux-gnu/libm-2.13.so
lldpd   17754 _lldpd  memREG  254,014768   1049642 
/lib/x86_64-linux-gnu/libdl-2.13.so
lldpd   17754 _lldpd  memREG  254,0  1574680   1908792 
/usr/lib/libperl.so.5.14.2
lldpd   17754 _lldpd  memREG  254,040656   1049689 
/lib/x86_64-linux-gnu/libwrap.so.0.7.6
lldpd   17754 _lldpd  memREG  254,0  1603600   1049640 
/lib/x86_64-linux-gnu/libc-2.13.so
lldpd   17754 _lldpd  memREG  254,0   649008   1909575 
/usr/lib/libnetsnmp.so.15.1.2
lldpd   17754 _lldpd  memREG  254,0  1195544   1909592 
/usr/lib/libnetsnmpmibs.so.15.1.2
lldpd   1775

Bug#779238: lldpd: LLDPd daemon freezes

2015-02-26 Thread Vincent Bernat
 ❦ 26 février 2015 12:36 +0200, Kiousis Alexandros  :

> This is what i get now from a stuck daemon:
>
> # ps aux | grep lldp | grep -v grep
> root 17752  0.0  0.0  46468  1432 ?SNs  Feb25   0:11
> /usr/sbin/lldpd -I eth*,tap*
> _lldpd   17754  0.0  0.0  46468  1080 ?SFeb25   0:11
> /usr/sbin/lldpd -I eth*,tap*
> # strace -p 17752
> Process 17752 attached - interrupt to quit
> read(5, ^C 
> Process 17752 detached
> # strace -p 17754
> Process 17754 attached - interrupt to quit
> write(32,
> "\1\200\302\0\0\16\222\207K\365\214\37\210\314\2\7\4<\331+\1\31<\4\7\3\222\207K\365\214\37"...,
> 195^C
> Process 17754 detached

In each case, also give the output of "lsof -n -p 17754" and "lsof -n -p
17752", so we know what 5 and 32 are. 5 is likely to be the read pipe of
the monitor process, so, it's normal to be stuck here. 32 is likely to
be the tap device and it's not normal to be stuck here.

I'll investigate a bit to check if the interface sockets are set in
blocking or non-blocking mode in this version.
-- 
Test input for validity and plausibility.
- The Elements of Programming Style (Kernighan & Plauger)


signature.asc
Description: PGP signature


Bug#779238: lldpd: LLDPd daemon freezes

2015-02-26 Thread Kiousis Alexandros

On 02/25/2015 09:20 PM, Vincent Bernat wrote:

What kind of VM is that? It's odd to run out of space on those tap
interfaces. This can be the cause of lldpd hanging since writing to the
devices is done synchronously while the remaining of the daemon is
asynchronous.


Just to be clear this doesn't happen on a particular host/vm combo but 
there is a correlation between lldp getting stuck and this log (i.e. no 
buffer on a tap if). This is a public vps infrastructure so i can't 
really tell if the vms on these interfaces do something weird. In this 
instance the vm is ours but it seems to be working fine.



Maybe you could get the output of strace the next time you run into the
problem (strace -p pidoflldpd). There are two processes but the one
likely to be blocked is the one running as _lldpd user.


This is what i get now from a stuck daemon:

# ps aux | grep lldp | grep -v grep
root 17752  0.0  0.0  46468  1432 ?SNs  Feb25   0:11 
/usr/sbin/lldpd -I eth*,tap*
_lldpd   17754  0.0  0.0  46468  1080 ?SFeb25   0:11 
/usr/sbin/lldpd -I eth*,tap*

# strace -p 17752
Process 17752 attached - interrupt to quit
read(5, ^C 
Process 17752 detached
# strace -p 17754
Process 17754 attached - interrupt to quit
write(32, 
"\1\200\302\0\0\16\222\207K\365\214\37\210\314\2\7\4<\331+\1\31<\4\7\3\222\207K\365\214\37"..., 
195^C

Process 17754 detached





--
---
Αλέξανδρος Κιούσης - al...@noc.grnet.gr
GRNET NOC System Administrator - http://noc.grnet.gr
---


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#779238: lldpd: LLDPd daemon freezes

2015-02-25 Thread Vincent Bernat
 ❦ 25 février 2015 20:34 +0200, Alex Kiousis  :

> We use lldp to get information about the parents of our hosts (about
> 100 nodes). Some time ago we switched to also broadcasting lldp
> information to the tap interfaces so the vms running on the machine
> can get that information. Since then lldpd has been randomly and on
> rare occasions (i.e. one host per week) freezing completely. The
> proccess is still alive but lldcpctl gets stuck and the daemon is not
> sending data (lldpctl from a vm is empty). Restarting the daemon fixes
> it.
>
> We also see these logs which must be relevant:
> lldp_send: unable to send packet on real device for tap14: No buffer space 
> available
>
> I haven't tested with the newer package available on backports but
> since the bug is difficult to reproduce

What kind of VM is that? It's odd to run out of space on those tap
interfaces. This can be the cause of lldpd hanging since writing to the
devices is done synchronously while the remaining of the daemon is
asynchronous.

Maybe you could get the output of strace the next time you run into the
problem (strace -p pidoflldpd). There are two processes but the one
likely to be blocked is the one running as _lldpd user.
-- 
Use debugging compilers.
- The Elements of Programming Style (Kernighan & Plauger)


signature.asc
Description: PGP signature


Bug#779238: lldpd: LLDPd daemon freezes

2015-02-25 Thread Alex Kiousis
Package: lldpd
Version: 0.5.7-2
Severity: normal

Dear Maintainer,
We use lldp to get information about the parents of our hosts (about 100 
nodes). Some time ago we switched to also broadcasting lldp information to the 
tap interfaces so the vms running on the machine can get that information. 
Since then lldpd has been randomly and on rare occasions (i.e. one host per 
week) freezing completely. The proccess is still alive but lldcpctl gets stuck 
and the daemon is not sending data (lldpctl from a vm is empty). Restarting the 
daemon fixes it.

We also see these logs which must be relevant:
lldp_send: unable to send packet on real device for tap14: No buffer space 
available

I haven't tested with the newer package available on backports but since the 
bug is difficult to reproduce




-- System Information:
Debian Release: 7.8
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-4-amd64 (SMP w/24 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages lldpd depends on:
ii  adduser3.113+nmu3
ii  libc6  2.13-38+deb7u7
ii  libsnmp15  5.4.3~dfsg-2.8+deb7u1
ii  libxml22.8.0+dfsg1-7+wheezy2

lldpd recommends no packages.

Versions of packages lldpd suggests:
pn  snmpd  

-- Configuration Files:
/etc/default/lldpd changed:
DAEMON_ARGS="-I eth*,tap*"


-- no debconf information


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org