Thanks for the response.  I've mounted a ramdisk at /mnt and have run
"doas route -n monitor > /mnt/route.monitor" in a tmux session for a
few days.  Here are some details:

erl3-01$ grep carp1 route.monitor  | sort | uniq -c
  91 RTM_ADD: Add Route: len 192, priority 146, table 0, if# 6, name
carp1, pid: 0, seq 0, errno 0
 428 RTM_ADD: Add Route: len 192, priority 18, table 0, if# 6, name
carp1, pid: 0, seq 0, errno 0
  43 RTM_DELETE: Delete Route: len 192, priority 146, table 0, if# 6,
name carp1, pid: 0, seq 0, errno 0
 478 RTM_DELETE: Delete Route: len 192, priority 18, table 0, if# 6,
name carp1, pid: 0, seq 0, errno 0
  31 RTM_IFINFO: iface status change: len 168, if# 6, name carp1,
link: backup, mtu: 1500,
flags:<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST>
  31 RTM_IFINFO: iface status change: len 168, if# 6, name carp1,
link: invalid, mtu: 1500, flags:<UP,BROADCAST,SIMPLEX,MULTICAST>
  31 RTM_IFINFO: iface status change: len 168, if# 6, name carp1,
link: master, mtu: 1500,
flags:<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST>
   1 RTM_RESOLVE: Route created by cloning: len 192, priority 146,
table 0, if# 6, name carp1, pid: 0, seq 0, errno 0
 385 RTM_RESOLVE: Route created by cloning: len 192, priority 18,
table 0, if# 6, name carp1, pid: 0, seq 0, errno 0

erl3-01$ grep vlan100 route.monitor  | sort | uniq -c
  31 RTM_IFINFO: iface status change: len 168, if# 8, name vlan100,
link: active, mtu: 1500,
flags:<UP,BROADCAST,RUNNING,PPROMISC,SIMPLEX,MULTICAST>
  31 RTM_IFINFO: iface status change: len 168, if# 8, name vlan100,
link: no carrier, mtu: 1500,
flags:<UP,BROADCAST,RUNNING,PPROMISC,SIMPLEX,MULTICAST>

erl3-01$ grep cnmac2 route.monitor  | sort | uniq -c
  57 RTM_ADD: Add Route: len 192, priority 3, table 0, if# 3, name
cnmac2, pid: 0, seq 0, errno 0
  57 RTM_DELETE: Delete Route: len 192, priority 3, table 0, if# 3,
name cnmac2, pid: 0, seq 0, errno 0
  31 RTM_IFINFO: iface status change: len 168, if# 3, name cnmac2,
link: active, mtu: 1500,
flags:<UP,BROADCAST,RUNNING,PPROMISC,ALLMULTI,SIMPLEX,MULTICAST>
  31 RTM_IFINFO: iface status change: len 168, if# 3, name cnmac2,
link: no carrier, mtu: 1500,
flags:<UP,BROADCAST,RUNNING,PPROMISC,ALLMULTI,SIMPLEX,MULTICAST>

It looks like the underlying cnmac2 interface is flapping...so, that's a bummer.

As generally underpowered as this machine is, might the kernel be
overwhelmed with other tasks, and have a watchdog timeout mark the
cnmac2 interface as down (due to some expired timeout)?

Just grasping for something here...my next steps are to swap this unit
out with the other one (to try and eliminate hardware failure of THIS
unit).  Any other suggestions?

On Mon, Feb 1, 2021 at 3:04 AM David Gwynne <da...@gwynne.id.au> wrote:
>
>
>
> > On 1 Feb 2021, at 6:02 pm, Bryan Stenson <bryan.sten...@gmail.com> wrote:
> >
> > Hi all -
> >
> > I'm trying to setup a pair of ERL3 octeon routers in master/standby
> > mode via carp/pfsync to route traffic from my internal lan to the
> > internet.  I've seen strange behavior wrt carp on these machines, so
> > in an attempt to reduce the problem, I've removed one completely.
> >
> > Even with only a single box (ERL3-01) on the network configured as a
> > carp member, the carp interface state periodically changes (as seen
> > from ifstated(8)).
> >
> > I'm wondering if disconnecting the other ERL3 device is a valid isolated 
> > test.
> > 1.  Will/might this cause issues with the carp device, as it cannot
> > determine state from any other host?
>
> If carp state flaps around while it is the only device on the network, that 
> would imply the parent device is flapping around.
>
> > 2.  Will/might this cause issues as it cannot send/receive pfsync
> > updates (the other node is disconnected).
>
> pfsync doesn't really care about carp state.
>
> > 3.  Is there something else in my setup causing carp to fail here?
>
> I'd be running "route monitor" and looking for link state changes on the carp 
> parent interface.
>
> > 4.  Could this be hardware/temperature related to this ERL3?  Wouldn't
> > I see an additional error in dmesg if the physical device (cnmac2)
> > failed periodically?
> >
> > I'd appreciate any pointers here...I feel like I'm missing something dumb.
>
> My first ideas are above. If it turns out the carp parent is stable we can 
> try come up with something else.
>
> dlg
>
> >
> > Thanks in advance.
> >
> > Bryan
> >
> > Here are some of my configs.  If I've missed including something
> > critical to help describe my setup, please let me know and I'll add
> > it.
> >
> > ## Help me OBSD-Misc Kenobi.  You're my only hope. ##
> >
> > erl3-01# uname -a
> > OpenBSD erl3-01.siliconvortex.com 6.8 GENERIC#522 octeon
> >
> > erl3-01# dmesg
> > ...
> > carp1: state transition: BACKUP -> MASTER
> > carp1: state transition: BACKUP -> MASTER
> > carp1: state transition: BACKUP -> MASTER
> > carp1: state transition: BACKUP -> MASTER
> > carp1: state transition: BACKUP -> MASTER
> > carp1: state transition: BACKUP -> MASTER
> >
> > erl3-01# tail mbox
> > Mon, 1 Feb 2021 06:49:26 +0000 (UTC)
> > From: Charlie Root <r...@erl3-01.siliconvortex.com>
> > Date: Mon, 1 Feb 2021 06:49:25 +0000 (UTC)
> > To: root@localhost
> > Subject: carp master changed
> > Message-ID: <515eb74cff427...@erl3-01.siliconvortex.com>
> > Status: RO
> >
> > master is now erl3-01.siliconvortex.com
> >
> >
> > erl3-01# sysctl -a | grep carp
> > net.inet.carp.allow=1
> > net.inet.carp.preempt=1
> > net.inet.carp.log=2
> >
> > erl3-01# cat /etc/hostname.carp1
> > #carp for lan side
> > 192.168.122.1/23 carpdev vlan100 vhid 1 pass somethinglongandsecret
> >
> > erl3-01# cat /etc/hostname.vlan100
> > vnetid 100 parent cnmac2
> > up
> >
> > erl3-01# cat /etc/hostname.cnmac2
> > inet 192.168.1.253 255.255.254.0
> >
> > erl3-01# cat /etc/hostname.pfsync0
> > up syncdev cnmac1
> >
> > erl3-01# cat /etc/hostname.cnmac1
> > inet 10.10.200.1 255.255.255.252
> >
> > erl3-01# cat /etc/ifstated.conf
> > # Initial State
> > init-state auto
> >
> > # Macros
> > if_carp_up="carp1.link.up"
> > if_carp_down="!carp1.link.up"
> >
> > state auto {
> >  if $if_carp_up {
> >    set-state master
> >  }
> >
> >  if $if_carp_down {
> >    set-state backup
> >  }
> > }
> >
> > state master {
> >  init {
> >    run "echo master is now `hostname` | mail -s 'carp master changed'
> > root@localhost"
> > }
> >
> >  if $if_carp_down {
> >    set-state backup
> >  }
> > }
> >
> > state backup {
> >  init {
> >    run "echo backup is now `hostname` | mail -s 'carp master changed
> > root@localhost"
> >  }
> >
> >  if $if_carp_up {
> >    set-state master
> >  }
> > }
> >
> > erl3-01# cat /etc/pf.conf
> > # adopted from https://www.openbsd.org/faq/pf/example1.html
> > wan_dev = cnmac0
> > lan_dev = cnmac2
> > carp_dev = vlan100
> > pfsync_dev = cnmac1
> > table <martians> { 0.0.0.0/8 10.0.0.0/8 127.0.0.0/8 169.254.0.0/16     \
> >    172.16.0.0/12 192.0.0.0/24 192.0.2.0/24 224.0.0.0/3 \
> >    192.168.0.0/16 198.18.0.0/15 198.51.100.0/24        \
> >    203.0.113.0/24 }
> >
> > # carp
> > pass quick on $lan_dev proto carp keep state (no-sync)
> >
> > # pfsync
> > pass quick on $pfsync_dev proto pfsync keep state (no-sync)
> >
> > set block-policy drop
> > set loginterface $wan_dev
> > set skip on lo0
> >
> > match in all scrub (no-df random-id max-mss 1440)
> >
> > # redirect DNS queries to localhost
> > pass in quick on { $carp_dev $lan_dev } proto { udp tcp } from any to
> > any port domain rdr-to 192.168.1.253 port domain
> >
> > # NAT to the world
> > match out on $wan_dev inet from !($wan_dev:network) to any nat-to 
> > ($wan_dev:0)
> >
> > antispoof quick for { $wan_dev }
> >
> > # martians
> > block in quick on $wan_dev from <martians> to any
> > block return out quick on $wan_dev from any to <martians>
> >
> > block all
> >
> > # manage buffer bloat
> > queue outq on $wan_dev flows 1024 bandwidth 3M max 3M qlimit 1024 default
> > queue inq on $lan_dev flows 1024 bandwidth 45M max 45M qlimit 1024 default
> >
> > pass out quick inet
> >
> > pass in on { $carp_dev $lan_dev } inet
> >
>

Reply via email to