[lxc-users] Intermittent network issue with containers

Joshua Schaeffer Tue, 30 Jun 2020 23:06:18 -0700

I'm not sure this is actually an issue with LXD but I've been scratching my 
head on this for a while and unable to figure out what is going on so reaching 
out to many different sources. I'm intermittently losing connection to all of 
my container's second interfaces. If I ping out *from* the container *to* an 
external address then the network connection is restored temporarily. Anywhere 
between 5 to 60 minutes later the problem reappears. On the surface it looks 
like a routing or reverse path filtering issue, but (I believe) I've setup 
those parameters properly.


For example I'm trying to ping from my local box (172.16.44.18) to the 
container's second interface called "veth-int-core" (10.2.80.129). Note that 
all general traffic is supposed to go out the first interface called 
"veth-mgmt" (10.2.28.65) and that the default gateway is set on this interface. 
I've set rp_filter on veth-int-core to 2 so the system should not drop the 
packet because of reverse path filtering.

    root@container1:~# cat /proc/sys/net/ipv4/conf/veth-int-core/rp_filter
    2


>From my local box I try ping the veth-int-core interface on the container and 
>receive no response:
    root@client:~$ date -u; ping -c 10 10.2.80.129; date -u
    Tue 30 Jun 2020 22:09:34 PM UTC
    PING 10.2.80.129 (10.2.80.129) 56(84) bytes of data.

    --- 10.2.80.129 ping statistics ---
    10 packets transmitted, 0 received, 100% packet loss, time 9200ms

    Tue 30 Jun 2020 22:09:53 PM UTC

If I sniff the wire on the container at the same time we can see the packet 
arrive with the ICMP request. We can also see an ICMP type 3 code 1 
(destination unreachable) response which includes the ICMP reply in the packet.

    root@container1:~# tcpdump -nevi any icmp
    tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 
262144 bytes
    16:09:34.626373  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 7638, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 1, length 
64
    16:09:35.633862  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 7689, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 2, length 
64
    16:09:36.657897  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 7882, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 3, length 
64
    16:09:37.682063  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 7901, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 4, length 
64
    16:09:37.695263  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 59700, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 9378, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 1, length 64
    16:09:37.695271  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 59701, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 9430, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 2, length 64
    16:09:37.695276  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 59702, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 9612, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 3, length 64
    16:09:38.705661  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 8081, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 5, length 
64
    16:09:39.729581  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 8101, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 6, length 
64
    16:09:40.753507  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 8299, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 7, length 
64
    16:09:41.759252  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 60134, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 9813, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 5, length 64
    16:09:41.759259  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 60135, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 10019, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 6, length 64
    16:09:41.759264  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 60136, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 10271, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 7, length 64
    16:09:41.777449  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 8474, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 8, length 
64
    16:09:42.801428  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 8491, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 9, length 
64
    16:09:43.825371  In e4:aa:5d:99:88:4a ethertype IPv4 (0x0800), length 100: 
(tos 0x0, ttl 62, id 8683, offset 0, flags [DF], proto ICMP (1), length 84)
        172.16.44.18 > 10.2.80.129: ICMP echo request, id 18986, seq 10, length 
64
    16:09:44.831260  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 60642, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 10484, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 8, length 64
    16:09:44.831267  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 60643, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 10689, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 9, length 64
    16:09:44.831272  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 128: 
(tos 0xc0, ttl 64, id 60644, offset 0, flags [none], proto ICMP (1), length 112)
        10.2.80.129 > 10.2.80.129: ICMP host 172.16.44.18 unreachable, length 92
        (tos 0x0, ttl 64, id 10878, offset 0, flags [none], proto ICMP (1), 
length 84)
        10.2.80.129 > 172.16.44.18: ICMP echo reply, id 18986, seq 10, length 64

To me this indicates a routing issue. It looks like the container doesn't know 
how to route the ICMP reply back to the client. However, if I query the routing 
table it knows it needs to use the gateway interface:

    root@container1:~# ip route get 172.16.44.18
    172.16.44.18 via 10.2.28.1 dev veth-mgmt src 10.2.28.65 uid 0
        cache

And the really odd part is that if I try to actually ping *from* the container 
*to* my local box it works AND afterwards my original ping *from* my local box 
*to* the container starts to work.

To demonstrate, if I start the ping *from* my box *to* the container and in the 
middle of the ping I run a second ping *from* the container *to* my box the 
packets sent after the return ping is initialized will work:

    root@client:~$ date -u; ping -c 30 10.2.80.129; date -u
    Tue 30 Jun 2020 22:30:29 PM UTC
    PING 10.2.80.129 (10.2.80.129) 56(84) bytes of data.
    64 bytes from 10.2.80.129: icmp_seq=4 ttl=62 time=1043 ms
    64 bytes from 10.2.80.129: icmp_seq=5 ttl=62 time=19.0 ms
    64 bytes from 10.2.80.129: icmp_seq=6 ttl=62 time=4.33 ms
    64 bytes from 10.2.80.129: icmp_seq=7 ttl=62 time=4.19 ms
    64 bytes from 10.2.80.129: icmp_seq=8 ttl=62 time=4.15 ms
    64 bytes from 10.2.80.129: icmp_seq=9 ttl=62 time=4.19 ms
    64 bytes from 10.2.80.129: icmp_seq=10 ttl=62 time=4.51 ms
    64 bytes from 10.2.80.129: icmp_seq=11 ttl=62 time=4.26 ms
    64 bytes from 10.2.80.129: icmp_seq=12 ttl=62 time=4.39 ms
    64 bytes from 10.2.80.129: icmp_seq=13 ttl=62 time=4.15 ms
    64 bytes from 10.2.80.129: icmp_seq=14 ttl=62 time=4.37 ms
    64 bytes from 10.2.80.129: icmp_seq=15 ttl=62 time=4.17 ms
    64 bytes from 10.2.80.129: icmp_seq=16 ttl=62 time=4.39 ms
    64 bytes from 10.2.80.129: icmp_seq=17 ttl=62 time=4.38 ms
    64 bytes from 10.2.80.129: icmp_seq=18 ttl=62 time=4.34 ms
    64 bytes from 10.2.80.129: icmp_seq=19 ttl=62 time=4.32 ms
    64 bytes from 10.2.80.129: icmp_seq=20 ttl=62 time=4.80 ms
    64 bytes from 10.2.80.129: icmp_seq=21 ttl=62 time=4.28 ms
    64 bytes from 10.2.80.129: icmp_seq=22 ttl=62 time=4.32 ms
    64 bytes from 10.2.80.129: icmp_seq=23 ttl=62 time=4.28 ms
    64 bytes from 10.2.80.129: icmp_seq=24 ttl=62 time=4.22 ms
    64 bytes from 10.2.80.129: icmp_seq=25 ttl=62 time=4.25 ms
    64 bytes from 10.2.80.129: icmp_seq=26 ttl=62 time=4.21 ms
    64 bytes from 10.2.80.129: icmp_seq=27 ttl=62 time=4.34 ms
    64 bytes from 10.2.80.129: icmp_seq=28 ttl=62 time=4.31 ms
    64 bytes from 10.2.80.129: icmp_seq=29 ttl=62 time=4.15 ms
    64 bytes from 10.2.80.129: icmp_seq=30 ttl=62 time=4.60 ms

    --- 10.2.80.129 ping statistics ---
    30 packets transmitted, 27 received, 10% packet loss, time 29137ms
    rtt min/avg/max/mdev = 4.145/43.328/1042.959/196.063 ms, pipe 2
    Tue 30 Jun 2020 22:31:01 PM UTC

    root@container1:~# date -u; ping -c 4 172.16.44.18; date -u
    Tue Jun 30 22:30:33 UTC 2020
    PING 172.16.44.18 (172.16.44.18) 56(84) bytes of data.
    64 bytes from 172.16.44.18: icmp_seq=1 ttl=63 time=444 ms
    64 bytes from 172.16.44.18: icmp_seq=2 ttl=63 time=4.30 ms
    64 bytes from 172.16.44.18: icmp_seq=3 ttl=63 time=4.27 ms
    64 bytes from 172.16.44.18: icmp_seq=4 ttl=63 time=4.23 ms

    --- 172.16.44.18 ping statistics ---
    4 packets transmitted, 4 received, 0% packet loss, time 3003ms
    rtt min/avg/max/mdev = 4.238/114.371/444.666/190.695 ms
    Tue Jun 30 22:30:36 UTC 2020

>From this point on I can successfully communicate with the veth-int-core 
>interface. If no traffic is pushed to that interface for anywhere between 5 to 
>60 minutes then the problem comes back. I've tried:

- Seeing if any information shows up in the kernel logs on the host (nothing 
that I could see).
- Restarting the containers.
- Restarting the LXD host.
- Moving the containers to another host (the problem persisted).
- Changing the rp_filter setting on one or both interfaces.
- Looking at the lxd logs to see if anything related shows up.

Any pointers on where I could look to get more info would be appreciated.

-- 
Thanks,
Joshua Schaeffer

_______________________________________________
lxc-users mailing list
lxc-users@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users

[lxc-users] Intermittent network issue with containers

Reply via email to