On 12/20/19 10:09 AM, Nick Wolff wrote:
Marko,

Are you aware of any write ups for using ng_eiface and ng_bridge instead of
if_bridge?
look in /usr/share/examples/netgraph here are a couple of examples of exactly what you ask for.
Thanks,

Nick Wolff

On Fri, Dec 20, 2019 at 6:22 AM Marko Zec <z...@fer.hr> wrote:

Perhaps you could ditch if_bridge(4) and epair(4), and try ng_eiface(4)
with ng_bridge(4) instead?  Works rock-solid 24/7 here on 11.2 / 11.3.

Marko

On Fri, 20 Dec 2019 11:19:24 +0100
"Patrick M. Hausen" <hau...@punkt.de> wrote:

Hi all,

we still experience occasional network outages in production,
yet have not been able to find the root cause.

We run around 50 servers with VNET jails. some of them with
a handful, the busiest ones with 50 or more jails each.

Every now and then the jails are not reachable over the net,
anymore. The server itself is up and running, all jails are
up and running, one can ssh to the server but none of the
jails can communicate over the network.

There seems to be no pattern to the time of occurrance except
that more jails on one system make it "more likely".
Also having more than one bridge, e.g. for private networks
between jails seems to increase the probability.
When a server shows the problem it tends to get into the state
rather frequently, a couple of hours inbetween. Then again
most servers run for weeks without exhibiting the problem.
That's what makes it so hard to reproduce. The last couple of
days one system was failing regularly until we reduced the number
of jails from around 80 to around 50. Now it seems stable again.

I have a test system with lots of jails that I work with gatling
that did not show a single failure so far :-(


Setup:

All jails are iocage jails with VNET interfaces. They are
connected to at least one bridge that starts with the
physical external interface as a member and gets jails'
epair interfaces added as they start up. All jails are managed
by iocage.

ifconfig_igb0="-rxcsum -rxcsum6 -txcsum -txcsum6 -vlanhwtag
-vlanhwtso up" cloned_interfaces="bridge0"
ifconfig_bridge0_name="inet0"
ifconfig_inet0="addm igb0 up"
ifconfig_inet0_ipv6="inet6 <host-address>/64 auto_linklocal"

$ iocage get interfaces vpro0087
vnet0:inet0

$ ifconfig inet0
inet0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0
mtu 1500 ether 90:1b:0e:63:ef:51
       inet6 fe80::921b:eff:fe63:ef51%inet0 prefixlen 64 scopeid 0x4
       inet6 <host-address> prefixlen 64
       nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
       groups: bridge
       id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
       maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
       root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
       member: vnet0.4 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
               ifmaxaddr 0 port 7 priority 128 path cost 2000
       member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
               ifmaxaddr 0 port 6 priority 128 path cost 2000
       member: igb0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
               ifmaxaddr 0 port 1 priority 128 path cost 2000000


What we tried:

At first we suspected the bridge to become "wedged" somehow. This was
corroborated by talking to various people at devsummits and EuroBSDCon
with Kristof Provost specifically suggesting that if_bridge was
still under giant lock and there might be a problem here that the
lock is not released under some race condition and then the entire
bridge subsystem would be stalled. That sounds plausible given the
random occurrance.

But I think we can rule out that one, because:

- ifconfig up/down does not help
- the host is still communicating fine over the same bridge interface
- tearing down the bridge, kldunload (!) of if_bridge.ko followed by
   a new kldload and reconstructing the members with `ifconfig addm`
   does not help, either
- only a host reboot restores function

Finally I created a not iocage managed jail on the problem host.
Please ignore the `iocage` in the path, I used it to populate the
root directory. But it is not started by iocage at boot time and
the manual config is this:

testjail {
         host.hostname = "testjail";   # hostname
         path = "/iocage/jails/testjail/root";     # root directory
         exec.clean;
         exec.system_user = "root";
         exec.jail_user = "root";
         vnet;
       vnet.interface = "epair999b";
         exec.prestart += "ifconfig epair999 create; ifconfig
epair999a inet6 2A00:B580:8000:8000::1/64 auto_linklocal";
exec.poststop += "sleep 2; ifconfig epair999a destroy; sleep 2";
         # Standard stuff
         exec.start += "/bin/sh /etc/rc";
         exec.stop = "/bin/sh /etc/rc.shutdown";
         exec.consolelog = "/var/log/jail_testjail_console.log";
         mount.devfs;          #mount devfs
         allow.raw_sockets;    #allow ping-pong
         devfs_ruleset="4";    #devfs ruleset for this jail
}

$ cat /iocage/jails/testjail/root/etc/rc.conf
hostname="testjail"

ifconfig_epair999b_ipv6="inet6 2A00:B580:8000:8000::2/64
auto_linklocal"

When I do `service jail onestart testjail` I can then ping6 the jail
from the host and the host from the jail. As you can see the
if_bridge is not involved in this traffic.

When the host is in the wedged state and I start this testjail the
same way, no communication across the epair interface is possible.

To me this seems to indicate that not the bridge but all epair
interfaces stop working at the very same time.


OS is RELENG_11_3, hardware and specifically network adapters vary,
we have igb, ix, ixl, bnxt ...


Does anyone have a suggestion what diagnostic measures could help to
pinpoint the culprit? The random occurrance and the fact that the
problem seems to prefer the production environment only makes this a
real pain ...


Thanks and kind regards,
Patrick
_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to