Re: carp flapping

2023-05-28 Thread Nick Holland

Followup...

On 5/12/23 08:17, Stuart Henderson wrote:

On 2023-05-12, Nick Holland  wrote:

...

I had several other people suggest network problems.  I'm not going to
say "impossible" or even "unlikely", but my understanding is that the
two machines are both plugged into the same switch, in the same rack.




I've since had someone more familiar with the physical environment say
my blind trust in their switch hw may be slightly misplaced. :)


You can also look at

netstat -ni -I ixl0
netstat -ni -I ixl0 -e
kstat ixl0:::



These looked REALLY clean.  no drops, fails or collisions.


which may give some other clues

even pfctl -si might have something relevant


Several people pointed out I was using the default advskew of 1 second,
which means a small network glitch (or system load?  maybe I'm all wrong
about this system never breaking a sweat, at least when it comes to
network traffic) would flip it, so I've increased it to 10 on both
machines (and apparently just induced a flip of my own. oops).  By the
nature of this system, some people will be annoyed by any flip, so it
really doesn't matter if it was a 1 second outage or a 30 second outage,
I just want the system available again after an unhappy event (or
routine maintenance).


the course adjustment in seconds is advbase, advskew is a much smaller
delay meant for a config with primary/backup where the backup advertises
just slightly less frequently.


Um. yeah.  I set advbase, and typed advskew in the e-mail. my bad.
After setting to 10, I have gone over two weeks without any flips, so that
looks like that is a pretty good fix.
 
Thanks for the guidance!


Nick.



Re: carp flapping

2023-05-16 Thread Kapetanakis Giannis

On 16/05/2023 00:11, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:

Nick, spare yourself the pain and just designate one machine as the
master.  This is how we run all our proxy server pairs (nginx,
squid, other stuff).  For a pair fooa/foob, 'a' is the master, and
gets advskew 100. The 'b' host gets 150. Make sure preemption is
enabled.

When it's upgrade time, upgrade the 'b' machine and reboot. If it
looks stable, set its advskew to 50 and wait for it to pick up
traffic.  Now upgrade and reboot the 'a' host. When it looks happy,
set 'b's advskew back to 150.

This keeps everything in a known state.  You are going to break
connections no matter what -- even when you let the master float
-- so you might as well do it under your own control.  We schedule
our updates for off-peak hours, and accept that the flip is going
to interrupt traffic.  You just have to live with it.

We moved to this scheme on all our proxies and firewalls seven
years ago and have never looked back.

--lyndon

Totally agree on this and on top of that add load balancers/routers in 
the mix which will run carp/relayd/pfsync/forwarding.


With sticky sessions, all requests will redirect to the same backend 
server and you can avoid breaking service connections.

These don't have to be big machines.

G



Re: carp flapping

2023-05-15 Thread Lyndon Nerenberg (VE7TFX/VE6BBM)
Nick, spare yourself the pain and just designate one machine as the
master.  This is how we run all our proxy server pairs (nginx,
squid, other stuff).  For a pair fooa/foob, 'a' is the master, and
gets advskew 100. The 'b' host gets 150. Make sure preemption is
enabled.

When it's upgrade time, upgrade the 'b' machine and reboot. If it
looks stable, set its advskew to 50 and wait for it to pick up
traffic.  Now upgrade and reboot the 'a' host. When it looks happy,
set 'b's advskew back to 150.

This keeps everything in a known state.  You are going to break
connections no matter what -- even when you let the master float
-- so you might as well do it under your own control.  We schedule
our updates for off-peak hours, and accept that the flip is going
to interrupt traffic.  You just have to live with it.

We moved to this scheme on all our proxies and firewalls seven
years ago and have never looked back.

--lyndon



Re: carp flapping

2023-05-12 Thread Kapetanakis Giannis
On 12/05/2023 14:43, Nick Holland wrote:
> I had several other people suggest network problems.  I'm not going to
> say "impossible" or even "unlikely", but my understanding is that the
> two machines are both plugged into the same switch, in the same rack.
>
> Several people pointed out I was using the default advskew of 1 second,
> which means a small network glitch (or system load?  maybe I'm all wrong
> about this system never breaking a sweat, at least when it comes to
> network traffic) would flip it, so I've increased it to 10 on both
> machines (and apparently just induced a flip of my own. oops).  By the
> nature of this system, some people will be annoyed by any flip, so it
> really doesn't matter if it was a 1 second outage or a 30 second outage,
> I just want the system available again after an unhappy event (or
> routine maintenance).
>
> Nick.

Usually it's a network problem. The big delay of 3 days you had also suggests 
that.

But on the other hand, I also had a similar problem in one of my load balancers 
(routing/fw/relayd), where the MASTER was becoming BACKUP for no obvious 
reason. I believed it was a network glitch, but couldn't trace it.

The problem after all was that they where pushing the limit of max pf states 
and relayd checks where failing. Not obvious to spot at all. I believe default 
is 20K.

pfctl -sm
pfctl -si

After increasing that limit with set limit states I've never had a glitch any 
more.

G



Re: carp flapping

2023-05-12 Thread Stuart Henderson
On 2023-05-12, Nick Holland  wrote:
> On 5/12/23 03:28, Stuart Henderson wrote:
>> On 2023-05-12, Nick Holland  wrote:
>>> Here's the problem I've seen:  I have my two machines flipping state
>>> randomly(?).  This bothers me because that means it is breaking  people's
>>> downloads.  Longest period betweek flips was less than two weeks.
>>>
>>> So ... I cranked up the carp logging to 5 and then 7 to see what it had
>>> to say about why...and it had almost nothing to say.
>> 
>> Does netstat -s -p carp give any enlightenment?
>
>
> ok, I just skewed the stats by taking the opportunity to bring the now
> backup up to -current, so node1 does not have the most recent flap:
>
> node1 $ uptime
>   7:18AM  up  8:22, 1 user, load averages: 0.00, 0.05, 0.08
>
> node1 $ doas netstat -s -p carp
> carp:
>  29981 packets received (IPv4)
>  0 packets received (IPv6)
>  0 packets discarded for bad interface
>  0 packets discarded for wrong TTL
>  0 packets shorter than header
>  0 discarded for bad checksums
>  0 discarded packets with a bad version
>  0 discarded because packet too short
>  0 discarded for bad authentication
>  0 discarded for unknown vhid
>  0 discarded because of a bad address list
>  0 packets sent (IPv4)
>  0 packets sent (IPv6)
>  0 send failed due to mbuf memory error
>  0 transitions to master
>
>   node2 $ uptime
>   7:19AM  up 4 days, 20:58, 2 users, load averages: 0.83, 0.78, 0.73
>
> $ ] netstat -s -p carp
> carp:
>  367836 packets received (IPv4)
>  0 packets received (IPv6)
>  0 packets discarded for bad interface
>  0 packets discarded for wrong TTL
>  0 packets shorter than header
>  0 discarded for bad checksums
>  0 discarded packets with a bad version
>  0 discarded because packet too short
>  0 discarded for bad authentication
>  0 discarded for unknown vhid
>  0 discarded because of a bad address list
>  52806 packets sent (IPv4)
>  0 packets sent (IPv6)
>  0 send failed due to mbuf memory error
>  2 transitions to master
>
>
> Will monitor going forward, though.
>
>
> I had several other people suggest network problems.  I'm not going to
> say "impossible" or even "unlikely", but my understanding is that the
> two machines are both plugged into the same switch, in the same rack.

You can also look at

netstat -ni -I ixl0
netstat -ni -I ixl0 -e
kstat ixl0:::

which may give some other clues

even pfctl -si might have something relevant

> Several people pointed out I was using the default advskew of 1 second,
> which means a small network glitch (or system load?  maybe I'm all wrong
> about this system never breaking a sweat, at least when it comes to
> network traffic) would flip it, so I've increased it to 10 on both
> machines (and apparently just induced a flip of my own. oops).  By the
> nature of this system, some people will be annoyed by any flip, so it
> really doesn't matter if it was a 1 second outage or a 30 second outage,
> I just want the system available again after an unhappy event (or
> routine maintenance).

the course adjustment in seconds is advbase, advskew is a much smaller
delay meant for a config with primary/backup where the backup advertises
just slightly less frequently.





Re: carp flapping

2023-05-12 Thread Nick Holland

On 5/12/23 03:28, Stuart Henderson wrote:

On 2023-05-12, Nick Holland  wrote:

Here's the problem I've seen:  I have my two machines flipping state
randomly(?).  This bothers me because that means it is breaking  people's
downloads.  Longest period betweek flips was less than two weeks.

So ... I cranked up the carp logging to 5 and then 7 to see what it had
to say about why...and it had almost nothing to say.


Does netstat -s -p carp give any enlightenment?



ok, I just skewed the stats by taking the opportunity to bring the now
backup up to -current, so node1 does not have the most recent flap:

node1 $ uptime
 7:18AM  up  8:22, 1 user, load averages: 0.00, 0.05, 0.08

node1 $ doas netstat -s -p carp
carp:
29981 packets received (IPv4)
0 packets received (IPv6)
0 packets discarded for bad interface
0 packets discarded for wrong TTL
0 packets shorter than header
0 discarded for bad checksums
0 discarded packets with a bad version
0 discarded because packet too short
0 discarded for bad authentication
0 discarded for unknown vhid
0 discarded because of a bad address list
0 packets sent (IPv4)
0 packets sent (IPv6)
0 send failed due to mbuf memory error
0 transitions to master

 node2 $ uptime
 7:19AM  up 4 days, 20:58, 2 users, load averages: 0.83, 0.78, 0.73

$ ] netstat -s -p carp
carp:
367836 packets received (IPv4)
0 packets received (IPv6)
0 packets discarded for bad interface
0 packets discarded for wrong TTL
0 packets shorter than header
0 discarded for bad checksums
0 discarded packets with a bad version
0 discarded because packet too short
0 discarded for bad authentication
0 discarded for unknown vhid
0 discarded because of a bad address list
52806 packets sent (IPv4)
0 packets sent (IPv6)
0 send failed due to mbuf memory error
2 transitions to master


Will monitor going forward, though.


I had several other people suggest network problems.  I'm not going to
say "impossible" or even "unlikely", but my understanding is that the
two machines are both plugged into the same switch, in the same rack.

Several people pointed out I was using the default advskew of 1 second,
which means a small network glitch (or system load?  maybe I'm all wrong
about this system never breaking a sweat, at least when it comes to
network traffic) would flip it, so I've increased it to 10 on both
machines (and apparently just induced a flip of my own. oops).  By the
nature of this system, some people will be annoyed by any flip, so it
really doesn't matter if it was a 1 second outage or a 30 second outage,
I just want the system available again after an unhappy event (or
routine maintenance).

Nick.



Re: carp flapping

2023-05-12 Thread Stuart Henderson
On 2023-05-12, Nick Holland  wrote:
> Here's the problem I've seen:  I have my two machines flipping state
> randomly(?).  This bothers me because that means it is breaking  people's
> downloads.  Longest period betweek flips was less than two weeks.
>
> So ... I cranked up the carp logging to 5 and then 7 to see what it had
> to say about why...and it had almost nothing to say.

Does netstat -s -p carp give any enlightenment?




carp flapping

2023-05-11 Thread Nick Holland

Hi,

I have a couple identical servers that provide a few services (not FW or
gateway -- http, ftp, etc.).  Figured they would make a great CARP pair,
so if the primary broke, the secondary would take over immediately.
It would also make maintenance windows shorter...make changes on secondary
machine, test, reboot primary to force the secondary to become master.

The two machines should be equals.  I have no preference on running on
one machine or the other.  IF nothing breaks, I'd prefer that the one
that is serving keep serving until I tell it otherwise.  Both machines
should have no issue with performance with the tasks they have, lots of
proc, lots of RAM, nvme disk, etc.

Here's the problem I've seen:  I have my two machines flipping state
randomly(?).  This bothers me because that means it is breaking  people's
downloads.  Longest period betweek flips was less than two weeks.

So ... I cranked up the carp logging to 5 and then 7 to see what it had
to say about why...and it had almost nothing to say.

Here is the info from messages from both machines for the most recent
flip.  Past ones look basically the same.

Node 2:
/var/log $ zgrep carp0 messages
May  9 21:51:23 node2 /bsd: carp0: state transition: BACKUP -> MASTER
May  9 21:51:25 node2 /bsd: carp0: state transition: MASTER -> BACKUP
May 11 16:36:04 node2 /bsd: carp0: state transition: BACKUP -> MASTER


Node 1:
/var/log $ zgrep carp messages
May  9 21:51:25 node1 /bsd: carp0: state transition: MASTER -> BACKUP
May  9 21:51:28 node1 /bsd: carp0: state transition: BACKUP -> MASTER
May 11 16:36:07 node1 /bsd: carp0: state transition: MASTER -> BACKUP


hostname.carp0 from both machines:
inet a.b.c.240 255.255.255.0 128.100.17.255 vhid 1 carpdev ixl0 pass censored
inet alias a.b.c.241 255.255.255.255 128.100.17.255
inet alias a.b.c.243 255.255.255.255 128.100.17.255
inet alias a.b.c.246 255.255.255.255 128.100.17.255

verified identical (before slight anonymizing) on both systems.

hostname.ixl0 on node1:
inet a.b.c.248/24

hostname.ixl0 on node2:
inet a.b.c.247 0xff00

pf.conf includes this before any other "quick" statements:
pass quick inet proto carp all


Is there something I'm missing?  Incorrect expectations on my part?


Nick.

dmesg:
OpenBSD 7.3-current (GENERIC.MP) #1175: Wed May  3 08:19:33 MDT 2023
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 50078154752 (47758MB)
avail mem = 48540807168 (46292MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.2 @ 0x6f3c3000 (84 entries)
bios0: vendor American Megatrends Inc. version "3.4" date 10/30/2020
bios0: Supermicro X11SPW-TF
efi0 at bios0: UEFI 2.7
efi0: American Megatrends rev 0x5000e
acpi0 at bios0: ACPI 6.2
acpi0: sleep states S0 S4 S5
acpi0: tables DSDT FACP FPDT FIDT SPMI UEFI SSDT MCFG HPET APIC MIGT MSCT PCAT 
PCCT RASF SLIT SRAT SVOS WDDT OEM4 OEM1 SSDT OEM3 SSDT SSDT DMAR HEST BERT ERST 
EINJ WSMT
acpi0: wakeup devices XHCI(S4) RP17(S4) PXSX(S4) RP18(S4) PXSX(S4) RP19(S4) 
PXSX(S4) RP20(S4) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) RP03(S4) 
PXSX(S4) RP04(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimcfg0 at acpi0
acpimcfg0: addr 0x8000, bus 0-255
acpihpet0 at acpi0: 2399 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz, 1900.06 MHz, 06-55-07
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PT,AVX512CD,AVX512BW,AVX512VL,PKU,WAITPKG,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
16-way L2 cache, 8MB 64b/line 11-way L3 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 25MHz
cpu0: mwait min=64, max=64, C-substates=0.2.0.2, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz, 1900.09 MHz, 06-55-07
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PT,AVX512CD,AVX512BW,AVX512VL,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1