epair failure in production on 11.1-STABLE (r328930) ? weird!

Dr Josef Karthauser Mon, 02 Jul 2018 14:12:43 -0700

We’re experiencing a strange issue in production failure with epair (which 
we’re using to talk vimage to jails).


FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 16:05:59 GMT 
2018     root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64

Looks like epair has suddenly stopped forwarding packets between the pair 
interfaces. Our server has been up for 82 days and it’s been working fine, but 
suddenly packets have stopped being forwarded between epairs across the entire 
system. (We’ve got around 30 epairs on the host).  So, we’ve got a sudden ARP 
resolution failure which is affecting all services. :(.

Here’s the test. On a working machine this works fine:

        # Create an email and put an IP address on it, so we can generate ARP 
traffic with PING. 
        root@magnesium:/usr/home/systems # ifconfig epair create
        epair7a
        root@magnesium:/usr/home/systems # ifconfig epair7a up
        root@magnesium:/usr/home/systems # ifconfig epair7b up
        root@magnesium:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30

        # Generate ARP traffic over the epair… should see arp requests on 
epair7b.
        root@magnesium:/usr/home/systems # ping 10.140.0.2
        PING 10.140.0.2 (10.140.0.2): 56 data bytes

        # Watch traffic coming in from the epair
        root@magnesium:/usr/home/systems # tcpdump -i epair7b
        10:22:27.446651 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 
28
        10:22:28.475086 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 
28
        ^C
        2 packets captured
        2 packets received by filter
        0 packets dropped by kernel

Works fine.

However, on the failing machine we don’t get any packets forwarded (any more — 
remember it’s been working fine for a few months - suddenly stopped working :( 
).

        root@s5:/usr/home/systems # ifconfig pair create
        epair19a
        root@s5:/usr/home/systems # ifconfig epair19a up
        root@s5:/usr/home/systems # ifconfig epair7b up
        root@s5:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30

        root@s5:/usr/home/systems # ping 10.140.0.2
        PING 10.140.0.2 (10.140.0.2): 56 data bytes

        root@s5:/usr/home/systems # tcpdump -ni epair19a
        09:24:20.396384 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 
28
        09:24:21.404737 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 
28
        ^C 

        root@s5:/usr/home/systems # tcpdump -ni epair19b
        [Tumble weed - no traffic seen]
        ^C

Has anyone seen this before? We’re going to reboot and see if that fixes the 
problem.

The failing kernel in question is:

FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 16:05:59 GMT 
2018     root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64


Break break. We’ve just seen a bug bugzilla report 22710, reporting that epair 
fails when the queue limit is hit (net.link.epair.netisr_maxqlen). We’ve just 
introduced a high bandwidth service on this machine and so it’s probably that 
that’s what’s caused the issue.

We’ve currently got a value of:

        net.link.epair.netisr_maxqlen: 2100

root@s5:/usr/home/systems # netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         1            1
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs         disabled          n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1    256   flow  default   ---
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256 source   direct   ---
ip6        6    256   flow  default   ---
epair      8   2100    cpu  default   CD-

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0   253 385468689        0        0 49360754 434829441
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     5        0        0        0     1144     1144
   0   0   arp        0     0  5573045        0        0        0  5573045
   0   0   ether      0     0 1125223166        0        0        0 1125223166
   0   0   ip6        0     4       90        0        0  1220274  1220364
   0   0   epair      0  2100        0        0      214 4994675481 4994675481

But we can’t see how much of the queue is currently being used, or what size we 
need to set it to.

But, why has hitting the queue limit broken it entirely! 

Help!

Cheers,
Joe
— 
Dr Josef Karthauser
Chief Technical Officer
(01225) 300371 / (07703) 596893
www.truespeed.com <http://www.truespeed.com/>
  / theTRUESPEED <http://www.facebook.com/theTRUESPEED> 
  @theTRUESPEED <https://twitter.com/thetruespeed>
 
This email contains TrueSpeed information, which may be privileged or 
confidential. It's meant only for the individual(s) or entity named above. If 
you're not the intended recipient, note that disclosing, copying, distributing 
or using this information is prohibited. If you've received this email in 
error, please let me know immediately on the email address above. Thank you.
We monitor our email system, and may record your emails.

_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

epair failure in production on 11.1-STABLE (r328930) ? weird!

Reply via email to