Re: After a while of system running no incoming UDP any more?

2017-08-11 Thread Marc Haber
On Fri, Aug 11, 2017 at 04:34:53PM +0200, Marc Haber wrote:
> On Fri, Jul 28, 2017 at 02:14:34PM +0200, Marc Haber wrote:
> > I can confirm that these two changes make a system in bad state work
> > again immediately. Will try the patch on 4.12.4 later today.
> 
> After upgrading my test systems to 4.12.5, the issue reappeared. This
> shows me that the patch indeed helped (my patched 4.12.4 kernels didn't
> show the bad behavior), and that the patch didn't make its way into
> 4.12.5. The patch applied to 4.12.5, kernels are building.

It seems to be in the freshly released 4.12.6.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


Re: After a while of system running no incoming UDP any more?

2017-08-11 Thread Marc Haber
On Fri, Jul 28, 2017 at 02:14:34PM +0200, Marc Haber wrote:
> I can confirm that these two changes make a system in bad state work
> again immediately. Will try the patch on 4.12.4 later today.

After upgrading my test systems to 4.12.5, the issue reappeared. This
shows me that the patch indeed helped (my patched 4.12.4 kernels didn't
show the bad behavior), and that the patch didn't make its way into
4.12.5. The patch applied to 4.12.5, kernels are building.

The run-time fix works as well.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


Re: After a while of system running no incoming UDP any more?

2017-07-28 Thread Marc Haber
On Fri, Jul 28, 2017 at 10:07:57AM +0200, Paolo Abeni wrote:
> Ad a workaround you can disable UDP early demux:
> 
> echo 0 > /proc/sys/net/ipv4/udp_early_demux
> 
> (will affect both ipv4 and ipv6).
> 
> and (if the system  is already into the bad state) increase the udp
> accounted memory limit, writing in /proc/sys/net/ipv4/udp_mem greater
> values than the current ones (the actual values depends on the system
> total memory).

I can confirm that these two changes make a system in bad state work
again immediately. Will try the patch on 4.12.4 later today.

Thanks for helping!

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


Re: After a while of system running no incoming UDP any more?

2017-07-28 Thread Eric Dumazet
On Fri, 2017-07-28 at 10:15 +0200, Paolo Abeni wrote:
> Hi,
> 
> On Fri, 2017-07-28 at 01:05 -0700, Eric Dumazet wrote:

> I think that the leaked sockets are still unhashed on close() (via
> udp_lib_close() -> sk_common_release() ->sk_proto->unhash()), so they
> should not be listed there ?!? (Rush answer, subject to lack of coffee
> issues)

If the daemon is still running, sockets are not closed and we can watch
refcount being increased.

We can see this even on an UDP_RR workload (no UDP flood at all)

lpk51:~# while :
> do
> ./super_netperf 10 -H 2002:af6:b34:: -t UDP_RR -l 1 -- -N
> done


lpk52:~# cat /proc/net/udp6|grep -v " 2 "
  sl  local_address remote_address
st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref
pointer drops
  293: F60A0220340B:F0A9
F60A0220330B:AB11 01 : 00:
 00 583763 387 880ddb862c00 0
 3251: F60A0220340B:FC37
F60A0220330B:EA7F 01 : 00:
 00 586126 356 880dd25a7940 0
 4043: F60A0220340B:FF4F
F60A0220330B:9A35 01 0500: 00:
 00 583760 409 880ddb863180 0
 7832: F60A0220340B:8E1C
F60A0220330B:DEC1 01 : 00:
 00 583766 387 880ddb863700 0
 8568: F60A0220340B:90FC
F60A0220330B:8A60 01 : 00:
 00 580369 379 880d9e7b2700 0
13085: F60A0220340B:A2A1
F60A0220330B:CDB3 01 : 00:
 00 586915 397 880e31994140 0
14975: F60A0220340B:AA03
F60A0220330B:9C71 01 : 00:
 00 586123 355 880d9d1baec0 0
15155: F60A0220340B:AAB7
F60A0220330B:D324 01 : 00:
 00 583769 367 880dd88141c0 0
25421: F60A0220340B:D2D1
F60A0220330B:CA33 01 : 00:
 00 586120 424 880dfda4b0c0 0
28774: F60A0220340B:DFEA
F60A0220330B:DD05 01 0500: 00:
 00 583757 411 880ddb862680 0





Re: After a while of system running no incoming UDP any more?

2017-07-28 Thread Paolo Abeni
Hi,

On Fri, 2017-07-28 at 01:05 -0700, Eric Dumazet wrote:
> On Fri, 2017-07-28 at 08:26 +0200, Marc Haber wrote:
> > On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> > > Once that a system enter the buggy status, do the packets reach the
> > > relevant socket's queue?
> > > 
> > > ss -u
> > > nstat |grep -e Udp -e Ip
> > > 
> > > will help checking that.
> > 
> > I now have the issue on one machine, a Xen guest acting as authoritative
> > nameserver for my domains. Here are the outputs during normal use, with
> > artificial queries coming in on eth0:
> > 
> > [9/1075]mh@impetus:~ $ ss -u
> > Recv-Q Send-Q Local Address:Port
> >   Peer Address:Port 
> >
> > 0  0  127.0.0.1:56547   
> >  127.0.0.1:domain   
> > 
> > 0  0 216.231.132.60:27667   
> > 198.41.0.4:domain   
> > 
> > 0  0 216.231.132.60:44121   
> >8.8.8.8:domain   
> > 
> > 0  0 216.231.132.60:29814   
> > 198.41.0.4:domain   
> > 
> > [10/1076]mh@impetus:~ $ ss -u
> > Recv-Q Send-Q Local Address:Port
> >   Peer Address:Port 
> >
> > [11/1076]mh@impetus:~ $ ss -u
> > Recv-Q Send-Q Local Address:Port
> >   Peer Address:Port 
> >
> > [12/1076]mh@impetus:~ $ ss -u
> > Recv-Q Send-Q Local Address:Port
> >   Peer Address:Port 
> >
> > [13/1076]mh@impetus:~ $ ss -u
> > Recv-Q Send-Q Local Address:Port
> >   Peer Address:Port 
> >
> > [14/1076]mh@impetus:~ $ nstat  | grep -e Udp -e Ip
> > IpInReceives400688 0.0
> > IpInAddrErrors  18567  0.0
> > IpInUnknownProtos   3  0.0
> > IpInDelivers330634 0.0
> > IpOutRequests   283637 0.0
> > UdpInDatagrams  145860 0.0
> > UdpNoPorts  1313   0.0
> > UdpInErrors 9356   0.0
> > UdpOutDatagrams 153093 0.0
> > UdpIgnoredMulti 34148  0.0
> > Ip6InReceives   161178 0.0
> > Ip6InNoRoutes   8  0.0
> > Ip6InDelivers   73841  0.0
> > Ip6OutRequests  77575  0.0
> > Ip6InMcastPkts  87332  0.0
> > Ip6OutMcastPkts 1090.0
> > Ip6InOctets 21880674   0.0
> > Ip6OutOctets96330590.0
> > Ip6InMcastOctets93714830.0
> > Ip6OutMcastOctets   6636   0.0
> > Ip6InNoECTPkts  161202 0.0
> > Ip6InECT1Pkts   15 0.0
> > Ip6InECT0Pkts   11 0.0
> > Ip6InCEPkts 4  0.0
> > Udp6InDatagrams 11725  0.0
> > Udp6NoPorts 2  0.0
> > Udp6InErrors1989   0.0
> > Udp6OutDatagrams14483  0.0
> > IpExtInBcastPkts34148  0.0
> > IpExtInOctets   47462716   0.0
> > IpExtOutOctets  31262696   0.0
> > IpExtInBcastOctets  74760590.0
> > IpExtInNoECTPkts400178 0.0
> > IpExtInECT1Pkts 22 0.0
> > IpExtInECT0Pkts 4810.0
> > IpExtInCEPkts   14 0.0
> > [15/1077]mh@impetus:~ $ nstat  | grep -e Udp -e Ip
> > IpInReceives25 0.0
> > IpInDelivers25 0.0
> > IpOutRequests   16 0.0
> > UdpInDatagrams  1  0.0
> > UdpInErrors 24 0.0
> > UdpOutDatagrams 16 0.0
> > Ip6InReceives   15 0.0
> > Ip6InDelivers   14 0.0
> > 

Re: After a while of system running no incoming UDP any more?

2017-07-28 Thread Paolo Abeni
Hi,

On Fri, 2017-07-28 at 08:26 +0200, Marc Haber wrote:
> On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> > Once that a system enter the buggy status, do the packets reach the
> > relevant socket's queue?
> > 
> > ss -u
> > nstat |grep -e Udp -e Ip
> > 
> > will help checking that.
> 
> I now have the issue on one machine, a Xen guest acting as authoritative
> nameserver for my domains. Here are the outputs during normal use, with
> artificial queries coming in on eth0:
> 
> [9/1075]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> 0  0  127.0.0.1:56547 
>127.0.0.1:domain   
> 0  0 216.231.132.60:27667 
>   198.41.0.4:domain   
> 0  0 216.231.132.60:44121 
>  8.8.8.8:domain   
> 0  0 216.231.132.60:29814 
>   198.41.0.4:domain   
> [10/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [11/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [12/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [13/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [14/1076]mh@impetus:~ $ nstat  | grep -e Udp -e Ip
> IpInReceives400688 0.0
> IpInAddrErrors  18567  0.0
> IpInUnknownProtos   3  0.0
> IpInDelivers330634 0.0
> IpOutRequests   283637 0.0
> UdpInDatagrams  145860 0.0
> UdpNoPorts  1313   0.0
> UdpInErrors 9356   0.0

Thanks for the info. This is compatible with what reported on:

https://bugzilla.kernel.org/show_bug.cgi?id=196469

and should be fixed by this patch:

http://marc.info/?l=linux-netdev=150115960024825=2

(approval pending)

Ad a workaround you can disable UDP early demux:

echo 0 > /proc/sys/net/ipv4/udp_early_demux

(will affect both ipv4 and ipv6).

and (if the system  is already into the bad state) increase the udp
accounted memory limit, writing in /proc/sys/net/ipv4/udp_mem greater
values than the current ones (the actual values depends on the system
total memory).

Feel free to test the above patch on your systems.

Cheers,

Paolo


Re: After a while of system running no incoming UDP any more?

2017-07-28 Thread Eric Dumazet
On Fri, 2017-07-28 at 08:26 +0200, Marc Haber wrote:
> On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> > Once that a system enter the buggy status, do the packets reach the
> > relevant socket's queue?
> > 
> > ss -u
> > nstat |grep -e Udp -e Ip
> > 
> > will help checking that.
> 
> I now have the issue on one machine, a Xen guest acting as authoritative
> nameserver for my domains. Here are the outputs during normal use, with
> artificial queries coming in on eth0:
> 
> [9/1075]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> 0  0  127.0.0.1:56547 
>127.0.0.1:domain   
> 0  0 216.231.132.60:27667 
>   198.41.0.4:domain   
> 0  0 216.231.132.60:44121 
>  8.8.8.8:domain   
> 0  0 216.231.132.60:29814 
>   198.41.0.4:domain   
> [10/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [11/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [12/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [13/1076]mh@impetus:~ $ ss -u
> Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> [14/1076]mh@impetus:~ $ nstat  | grep -e Udp -e Ip
> IpInReceives400688 0.0
> IpInAddrErrors  18567  0.0
> IpInUnknownProtos   3  0.0
> IpInDelivers330634 0.0
> IpOutRequests   283637 0.0
> UdpInDatagrams  145860 0.0
> UdpNoPorts  1313   0.0
> UdpInErrors 9356   0.0
> UdpOutDatagrams 153093 0.0
> UdpIgnoredMulti 34148  0.0
> Ip6InReceives   161178 0.0
> Ip6InNoRoutes   8  0.0
> Ip6InDelivers   73841  0.0
> Ip6OutRequests  77575  0.0
> Ip6InMcastPkts  87332  0.0
> Ip6OutMcastPkts 1090.0
> Ip6InOctets 21880674   0.0
> Ip6OutOctets96330590.0
> Ip6InMcastOctets93714830.0
> Ip6OutMcastOctets   6636   0.0
> Ip6InNoECTPkts  161202 0.0
> Ip6InECT1Pkts   15 0.0
> Ip6InECT0Pkts   11 0.0
> Ip6InCEPkts 4  0.0
> Udp6InDatagrams 11725  0.0
> Udp6NoPorts 2  0.0
> Udp6InErrors1989   0.0
> Udp6OutDatagrams14483  0.0
> IpExtInBcastPkts34148  0.0
> IpExtInOctets   47462716   0.0
> IpExtOutOctets  31262696   0.0
> IpExtInBcastOctets  74760590.0
> IpExtInNoECTPkts400178 0.0
> IpExtInECT1Pkts 22 0.0
> IpExtInECT0Pkts 4810.0
> IpExtInCEPkts   14 0.0
> [15/1077]mh@impetus:~ $ nstat  | grep -e Udp -e Ip
> IpInReceives25 0.0
> IpInDelivers25 0.0
> IpOutRequests   16 0.0
> UdpInDatagrams  1  0.0
> UdpInErrors 24 0.0
> UdpOutDatagrams 16 0.0
> Ip6InReceives   15 0.0
> Ip6InDelivers   14 0.0
> Ip6OutRequests  12 0.0
> Ip6InMcastPkts  1  0.0
> Ip6InOctets 1219   0.0
> Ip6OutOctets4384   0.0
> Ip6InMcastOctets131

Re: After a while of system running no incoming UDP any more?

2017-07-28 Thread Marc Haber
On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> Once that a system enter the buggy status, do the packets reach the
> relevant socket's queue?
> 
> ss -u
> nstat |grep -e Udp -e Ip
> 
> will help checking that.

I now have the issue on one machine, a Xen guest acting as authoritative
nameserver for my domains. Here are the outputs during normal use, with
artificial queries coming in on eth0:

[9/1075]mh@impetus:~ $ ss -u
Recv-Q Send-Q Local Address:Port
  Peer Address:Port
0  0  127.0.0.1:56547   
 127.0.0.1:domain   
0  0 216.231.132.60:27667   
198.41.0.4:domain   
0  0 216.231.132.60:44121   
   8.8.8.8:domain   
0  0 216.231.132.60:29814   
198.41.0.4:domain   
[10/1076]mh@impetus:~ $ ss -u
Recv-Q Send-Q Local Address:Port
  Peer Address:Port
[11/1076]mh@impetus:~ $ ss -u
Recv-Q Send-Q Local Address:Port
  Peer Address:Port
[12/1076]mh@impetus:~ $ ss -u
Recv-Q Send-Q Local Address:Port
  Peer Address:Port
[13/1076]mh@impetus:~ $ ss -u
Recv-Q Send-Q Local Address:Port
  Peer Address:Port
[14/1076]mh@impetus:~ $ nstat  | grep -e Udp -e Ip
IpInReceives400688 0.0
IpInAddrErrors  18567  0.0
IpInUnknownProtos   3  0.0
IpInDelivers330634 0.0
IpOutRequests   283637 0.0
UdpInDatagrams  145860 0.0
UdpNoPorts  1313   0.0
UdpInErrors 9356   0.0
UdpOutDatagrams 153093 0.0
UdpIgnoredMulti 34148  0.0
Ip6InReceives   161178 0.0
Ip6InNoRoutes   8  0.0
Ip6InDelivers   73841  0.0
Ip6OutRequests  77575  0.0
Ip6InMcastPkts  87332  0.0
Ip6OutMcastPkts 1090.0
Ip6InOctets 21880674   0.0
Ip6OutOctets96330590.0
Ip6InMcastOctets93714830.0
Ip6OutMcastOctets   6636   0.0
Ip6InNoECTPkts  161202 0.0
Ip6InECT1Pkts   15 0.0
Ip6InECT0Pkts   11 0.0
Ip6InCEPkts 4  0.0
Udp6InDatagrams 11725  0.0
Udp6NoPorts 2  0.0
Udp6InErrors1989   0.0
Udp6OutDatagrams14483  0.0
IpExtInBcastPkts34148  0.0
IpExtInOctets   47462716   0.0
IpExtOutOctets  31262696   0.0
IpExtInBcastOctets  74760590.0
IpExtInNoECTPkts400178 0.0
IpExtInECT1Pkts 22 0.0
IpExtInECT0Pkts 4810.0
IpExtInCEPkts   14 0.0
[15/1077]mh@impetus:~ $ nstat  | grep -e Udp -e Ip
IpInReceives25 0.0
IpInDelivers25 0.0
IpOutRequests   16 0.0
UdpInDatagrams  1  0.0
UdpInErrors 24 0.0
UdpOutDatagrams 16 0.0
Ip6InReceives   15 0.0
Ip6InDelivers   14 0.0
Ip6OutRequests  12 0.0
Ip6InMcastPkts  1  0.0
Ip6InOctets 1219   0.0
Ip6OutOctets4384   0.0
Ip6InMcastOctets1310.0
Ip6InNoECTPkts  15 0.0
IpExtInOctets   11779  0.0
IpExtOutOctets  1023   0.0
IpExtInNoECTPkts25 0.0

Re: After a while of system running no incoming UDP any more?

2017-07-26 Thread Paolo Abeni
On Wed, 2017-07-26 at 10:10 +0200, Marc Haber wrote:
> On Tue, Jul 25, 2017 at 02:17:52PM +0200, Paolo Abeni wrote:
> > On Tue, 2017-07-25 at 13:57 +0200, Marc Haber wrote:
> > > On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> > > > Once that a system enter the buggy status, do the packets reach the
> > > > relevant socket's queue?
> > > > 
> > > > ss -u
> > > 
> > > That one only shows table headers on an unaffected system in normal
> > > operation, right?
> > 
> > This one shows the current lenght of the socket receive queue (Recv-Q,
> > the first column). If the packets land into the skbuff (and the user
> > space reader for some reason is not woken up) such value will grow over
> > time.
> 
> Only that there is no value:
> [4/4992]mh@swivel:~ $ ss -u
> Recv-Q Send-Q Local Address:Port Peer Address:Port
>   
> [5/4992]mh@swivel:~ $
> 
> (is that the intended behavior on a system thiat is not affected by the
> issue?)

That means there are no open UDP connected sockets in the system at the
moment you  run ss -u. I forgot to specify you must add also the '-a'
command line option to the 'ss' tool to show all udp sockets regardless
theis state:

ss -ua

Anyway this issue looks quite similar to:

https://bugzilla.kernel.org/show_bug.cgi?id=196469

Which contains some more information. I suggest to follow-up on such
bz.

Cheers,

Paolo






Re: After a while of system running no incoming UDP any more?

2017-07-26 Thread Marc Haber
On Tue, Jul 25, 2017 at 02:17:52PM +0200, Paolo Abeni wrote:
> On Tue, 2017-07-25 at 13:57 +0200, Marc Haber wrote:
> > On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> > > Once that a system enter the buggy status, do the packets reach the
> > > relevant socket's queue?
> > > 
> > > ss -u
> > 
> > That one only shows table headers on an unaffected system in normal
> > operation, right?
> 
> This one shows the current lenght of the socket receive queue (Recv-Q,
> the first column). If the packets land into the skbuff (and the user
> space reader for some reason is not woken up) such value will grow over
> time.

Only that there is no value:
[4/4992]mh@swivel:~ $ ss -u
Recv-Q Send-Q Local Address:Port Peer Address:Port  
[5/4992]mh@swivel:~ $

(is that the intended behavior on a system thiat is not affected by the
issue?)

> > > nstat |grep -e Udp -e Ip
> > > 
> > > will help checking that.
> > 
> > An unaffected system will show UdpInDatagrams, right?
> > 
> > But where is the connection to the relevant socket's queue?
> 
> If the socket queue lenght (as reported above) does not increase,
> IP/UDP stats could give an hint of where and why the packets stop
> traversing the network stack.

We'll see. Still waiting for the phenomenon to show up again.

> Beyond that, you can try using perf probes or kprobe/systemtap to [try
> to] track the relevant packets inside the kernel.

That's way beyond my kernel foo, I'm afraid.

Thanks for helping, I'll report back.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


Re: After a while of system running no incoming UDP any more?

2017-07-25 Thread Paolo Abeni
On Tue, 2017-07-25 at 13:57 +0200, Marc Haber wrote:
> On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> > Once that a system enter the buggy status, do the packets reach the
> > relevant socket's queue?
> > 
> > ss -u
> 
> That one only shows table headers on an unaffected system in normal
> operation, right?

This one shows the current lenght of the socket receive queue (Recv-Q,
the first column). If the packets land into the skbuff (and the user
space reader for some reason is not woken up) such value will grow over
time.

> > nstat |grep -e Udp -e Ip
> > 
> > will help checking that.
> 
> An unaffected system will show UdpInDatagrams, right?
> 
> But where is the connection to the relevant socket's queue?

If the socket queue lenght (as reported above) does not increase,
IP/UDP stats could give an hint of where and why the packets stop
traversing the network stack.

Beyond that, you can try using perf probes or kprobe/systemtap to [try
to] track the relevant packets inside the kernel.

/P


Re: After a while of system running no incoming UDP any more?

2017-07-25 Thread Marc Haber
Hi Paolo,

thanks for your answer. I appreciate that.

On Mon, Jul 24, 2017 at 04:19:10PM +0200, Paolo Abeni wrote:
> On Mon, 2017-07-24 at 14:09 +0200, Marc Haber wrote:
> > Before I begin running older kernels on productive systems, I would like
> > to ask wether there have been recent changes in the 4.11 => 4.12
> > development cycle that might cause an issue like that.
> 
> While there has been some activity regarding the UDP protocol lately,
> almost nothing touched UDP in the 4.11 release cycle.

4.11 is good, 4.12 is bad.

> The issue you describe looks similar to the bug fixed by the commit
> 9bd780f5e066 ("udp: fix poll()"), but the bugged code is only in later
> kernels. 

That one is in v4.13-rc1 and v4.13-rc2, but it doesn't apply to my 4.12
trees.

> > Any idea what might be happening here and what else I could try?
> 
> Once that a system enter the buggy status, do the packets reach the
> relevant socket's queue?
> 
> ss -u

That one only shows table headers on an unaffected system in normal
operation, right?

> nstat |grep -e Udp -e Ip
> 
> will help checking that.

An unaffected system will show UdpInDatagrams, right?

But where is the connection to the relevant socket's queue?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


Re: After a while of system running no incoming UDP any more?

2017-07-24 Thread Paolo Abeni
Hi,

On Mon, 2017-07-24 at 14:09 +0200, Marc Haber wrote:
> Before I begin running older kernels on productive systems, I would like
> to ask wether there have been recent changes in the 4.11 => 4.12
> development cycle that might cause an issue like that.

While there has been some activity regarding the UDP protocol lately,
almost nothing touched UDP in the 4.11 release cycle.

The issue you describe looks similar to the bug fixed by the commit
9bd780f5e066 ("udp: fix poll()"), but the bugged code is only in later
kernels. 

> Any idea what might be happening here and what else I could try?

Once that a system enter the buggy status, do the packets reach the
relevant socket's queue?

ss -u
nstat |grep -e Udp -e Ip

will help checking that.

Thanks,

Paol


After a while of system running no incoming UDP any more?

2017-07-24 Thread Marc Haber
Hi,

I am running ~ 50 servers, most of them as KVM guests, some of them as
Xen guests, and even less of them on hardware, and have recently updated
to Debian stretch. I usually use kernels locally built from the latest
vanille stable release.

Roughly since the upgrade to Debian stretch and kernel 4.12, some of my
systems have begun to not forward UDP packets (such as incoming DNS
replies) to the user space. When this happens, I see the packet coming
in on tcpdump -p, but the application never sees it and eventuelly times
out. An strace on the process sees the process waiting on the select()
syscall and nothing happens when the system receives the UDP packet. I
do also see the same phenomenon with ntp. A reboot always fixes the
issue. 

Runnign wireshark on a pcap file obtained on an affected systems does
show all checksums to be in order. Both IPv4 and IPv6 are affected, and
in the DNS case, switching dig/drill or even the system resolver to TCP
also fixes the issue.

This happens only after the system has been running for a few days, and
I have seen this happen on both KVM and Xen guests, but not (yet) on
real hardware. In my zoo of servers, this happens - over the entire
sample - about twice a week, often enough to be annoying and seldomly
enough to make debugging really difficult since you'll never know in
advance which system will have the issue for the next time.

I have therefore been reluctant to downgrade kernel or system since that
would mean days of work. Bisecting is probably out of the question since
you'll never know when "git bisect good" is a sufficiently safe
assumption.

Before I begin running older kernels on productive systems, I would like
to ask wether there have been recent changes in the 4.11 => 4.12
development cycle that might cause an issue like that.

Since I have never seen the issue on stretch systems when they were
still running 4.11.8 (the latest 4.11 kernel that I had deployed before
switching over to 4.12), I do really suspect the kernel, and I do also
suspect that network interface offloading is probably not the culprit.

On the KVM guests, I use virtio-net, and I had that one high on my list
until one of the two Xen guests that doesn't show any network modules
loaded has been showing the phenomenon as well.

That Xen guest outputs the following to lshw -C network:

that doesn't show any network modules loaded has been showing the
phenomenon as well.

That Xen guest outputs the following to lshw -C network:

  *-network
   description: Ethernet interface
   physical id: 1
   logical name: eth0
   serial: 0e:06:5f:74:48:97
   capabilities: ethernet physical
   configuration: broadcast=yes driver=vif ip= link=yes 
multicast=yes

So I assume that this one is not using virtio-net, so virtio-net seems
safe as well.

Any idea what might be happening here and what else I could try?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421