Re: carp flapping
Followup... On 5/12/23 08:17, Stuart Henderson wrote: On 2023-05-12, Nick Holland wrote: ... I had several other people suggest network problems. I'm not going to say "impossible" or even "unlikely", but my understanding is that the two machines are both plugged into the same switch, in the same rack. I've since had someone more familiar with the physical environment say my blind trust in their switch hw may be slightly misplaced. :) You can also look at netstat -ni -I ixl0 netstat -ni -I ixl0 -e kstat ixl0::: These looked REALLY clean. no drops, fails or collisions. which may give some other clues even pfctl -si might have something relevant Several people pointed out I was using the default advskew of 1 second, which means a small network glitch (or system load? maybe I'm all wrong about this system never breaking a sweat, at least when it comes to network traffic) would flip it, so I've increased it to 10 on both machines (and apparently just induced a flip of my own. oops). By the nature of this system, some people will be annoyed by any flip, so it really doesn't matter if it was a 1 second outage or a 30 second outage, I just want the system available again after an unhappy event (or routine maintenance). the course adjustment in seconds is advbase, advskew is a much smaller delay meant for a config with primary/backup where the backup advertises just slightly less frequently. Um. yeah. I set advbase, and typed advskew in the e-mail. my bad. After setting to 10, I have gone over two weeks without any flips, so that looks like that is a pretty good fix. Thanks for the guidance! Nick.
Re: carp flapping
On 16/05/2023 00:11, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote: Nick, spare yourself the pain and just designate one machine as the master. This is how we run all our proxy server pairs (nginx, squid, other stuff). For a pair fooa/foob, 'a' is the master, and gets advskew 100. The 'b' host gets 150. Make sure preemption is enabled. When it's upgrade time, upgrade the 'b' machine and reboot. If it looks stable, set its advskew to 50 and wait for it to pick up traffic. Now upgrade and reboot the 'a' host. When it looks happy, set 'b's advskew back to 150. This keeps everything in a known state. You are going to break connections no matter what -- even when you let the master float -- so you might as well do it under your own control. We schedule our updates for off-peak hours, and accept that the flip is going to interrupt traffic. You just have to live with it. We moved to this scheme on all our proxies and firewalls seven years ago and have never looked back. --lyndon Totally agree on this and on top of that add load balancers/routers in the mix which will run carp/relayd/pfsync/forwarding. With sticky sessions, all requests will redirect to the same backend server and you can avoid breaking service connections. These don't have to be big machines. G
Re: carp flapping
Nick, spare yourself the pain and just designate one machine as the master. This is how we run all our proxy server pairs (nginx, squid, other stuff). For a pair fooa/foob, 'a' is the master, and gets advskew 100. The 'b' host gets 150. Make sure preemption is enabled. When it's upgrade time, upgrade the 'b' machine and reboot. If it looks stable, set its advskew to 50 and wait for it to pick up traffic. Now upgrade and reboot the 'a' host. When it looks happy, set 'b's advskew back to 150. This keeps everything in a known state. You are going to break connections no matter what -- even when you let the master float -- so you might as well do it under your own control. We schedule our updates for off-peak hours, and accept that the flip is going to interrupt traffic. You just have to live with it. We moved to this scheme on all our proxies and firewalls seven years ago and have never looked back. --lyndon
Re: carp flapping
On 12/05/2023 14:43, Nick Holland wrote: > I had several other people suggest network problems. I'm not going to > say "impossible" or even "unlikely", but my understanding is that the > two machines are both plugged into the same switch, in the same rack. > > Several people pointed out I was using the default advskew of 1 second, > which means a small network glitch (or system load? maybe I'm all wrong > about this system never breaking a sweat, at least when it comes to > network traffic) would flip it, so I've increased it to 10 on both > machines (and apparently just induced a flip of my own. oops). By the > nature of this system, some people will be annoyed by any flip, so it > really doesn't matter if it was a 1 second outage or a 30 second outage, > I just want the system available again after an unhappy event (or > routine maintenance). > > Nick. Usually it's a network problem. The big delay of 3 days you had also suggests that. But on the other hand, I also had a similar problem in one of my load balancers (routing/fw/relayd), where the MASTER was becoming BACKUP for no obvious reason. I believed it was a network glitch, but couldn't trace it. The problem after all was that they where pushing the limit of max pf states and relayd checks where failing. Not obvious to spot at all. I believe default is 20K. pfctl -sm pfctl -si After increasing that limit with set limit states I've never had a glitch any more. G
Re: carp flapping
On 2023-05-12, Nick Holland wrote: > On 5/12/23 03:28, Stuart Henderson wrote: >> On 2023-05-12, Nick Holland wrote: >>> Here's the problem I've seen: I have my two machines flipping state >>> randomly(?). This bothers me because that means it is breaking people's >>> downloads. Longest period betweek flips was less than two weeks. >>> >>> So ... I cranked up the carp logging to 5 and then 7 to see what it had >>> to say about why...and it had almost nothing to say. >> >> Does netstat -s -p carp give any enlightenment? > > > ok, I just skewed the stats by taking the opportunity to bring the now > backup up to -current, so node1 does not have the most recent flap: > > node1 $ uptime > 7:18AM up 8:22, 1 user, load averages: 0.00, 0.05, 0.08 > > node1 $ doas netstat -s -p carp > carp: > 29981 packets received (IPv4) > 0 packets received (IPv6) > 0 packets discarded for bad interface > 0 packets discarded for wrong TTL > 0 packets shorter than header > 0 discarded for bad checksums > 0 discarded packets with a bad version > 0 discarded because packet too short > 0 discarded for bad authentication > 0 discarded for unknown vhid > 0 discarded because of a bad address list > 0 packets sent (IPv4) > 0 packets sent (IPv6) > 0 send failed due to mbuf memory error > 0 transitions to master > > node2 $ uptime > 7:19AM up 4 days, 20:58, 2 users, load averages: 0.83, 0.78, 0.73 > > $ ] netstat -s -p carp > carp: > 367836 packets received (IPv4) > 0 packets received (IPv6) > 0 packets discarded for bad interface > 0 packets discarded for wrong TTL > 0 packets shorter than header > 0 discarded for bad checksums > 0 discarded packets with a bad version > 0 discarded because packet too short > 0 discarded for bad authentication > 0 discarded for unknown vhid > 0 discarded because of a bad address list > 52806 packets sent (IPv4) > 0 packets sent (IPv6) > 0 send failed due to mbuf memory error > 2 transitions to master > > > Will monitor going forward, though. > > > I had several other people suggest network problems. I'm not going to > say "impossible" or even "unlikely", but my understanding is that the > two machines are both plugged into the same switch, in the same rack. You can also look at netstat -ni -I ixl0 netstat -ni -I ixl0 -e kstat ixl0::: which may give some other clues even pfctl -si might have something relevant > Several people pointed out I was using the default advskew of 1 second, > which means a small network glitch (or system load? maybe I'm all wrong > about this system never breaking a sweat, at least when it comes to > network traffic) would flip it, so I've increased it to 10 on both > machines (and apparently just induced a flip of my own. oops). By the > nature of this system, some people will be annoyed by any flip, so it > really doesn't matter if it was a 1 second outage or a 30 second outage, > I just want the system available again after an unhappy event (or > routine maintenance). the course adjustment in seconds is advbase, advskew is a much smaller delay meant for a config with primary/backup where the backup advertises just slightly less frequently.
Re: carp flapping
On 5/12/23 03:28, Stuart Henderson wrote: On 2023-05-12, Nick Holland wrote: Here's the problem I've seen: I have my two machines flipping state randomly(?). This bothers me because that means it is breaking people's downloads. Longest period betweek flips was less than two weeks. So ... I cranked up the carp logging to 5 and then 7 to see what it had to say about why...and it had almost nothing to say. Does netstat -s -p carp give any enlightenment? ok, I just skewed the stats by taking the opportunity to bring the now backup up to -current, so node1 does not have the most recent flap: node1 $ uptime 7:18AM up 8:22, 1 user, load averages: 0.00, 0.05, 0.08 node1 $ doas netstat -s -p carp carp: 29981 packets received (IPv4) 0 packets received (IPv6) 0 packets discarded for bad interface 0 packets discarded for wrong TTL 0 packets shorter than header 0 discarded for bad checksums 0 discarded packets with a bad version 0 discarded because packet too short 0 discarded for bad authentication 0 discarded for unknown vhid 0 discarded because of a bad address list 0 packets sent (IPv4) 0 packets sent (IPv6) 0 send failed due to mbuf memory error 0 transitions to master node2 $ uptime 7:19AM up 4 days, 20:58, 2 users, load averages: 0.83, 0.78, 0.73 $ ] netstat -s -p carp carp: 367836 packets received (IPv4) 0 packets received (IPv6) 0 packets discarded for bad interface 0 packets discarded for wrong TTL 0 packets shorter than header 0 discarded for bad checksums 0 discarded packets with a bad version 0 discarded because packet too short 0 discarded for bad authentication 0 discarded for unknown vhid 0 discarded because of a bad address list 52806 packets sent (IPv4) 0 packets sent (IPv6) 0 send failed due to mbuf memory error 2 transitions to master Will monitor going forward, though. I had several other people suggest network problems. I'm not going to say "impossible" or even "unlikely", but my understanding is that the two machines are both plugged into the same switch, in the same rack. Several people pointed out I was using the default advskew of 1 second, which means a small network glitch (or system load? maybe I'm all wrong about this system never breaking a sweat, at least when it comes to network traffic) would flip it, so I've increased it to 10 on both machines (and apparently just induced a flip of my own. oops). By the nature of this system, some people will be annoyed by any flip, so it really doesn't matter if it was a 1 second outage or a 30 second outage, I just want the system available again after an unhappy event (or routine maintenance). Nick.
Re: carp flapping
On 2023-05-12, Nick Holland wrote: > Here's the problem I've seen: I have my two machines flipping state > randomly(?). This bothers me because that means it is breaking people's > downloads. Longest period betweek flips was less than two weeks. > > So ... I cranked up the carp logging to 5 and then 7 to see what it had > to say about why...and it had almost nothing to say. Does netstat -s -p carp give any enlightenment?
carp flapping
Hi, I have a couple identical servers that provide a few services (not FW or gateway -- http, ftp, etc.). Figured they would make a great CARP pair, so if the primary broke, the secondary would take over immediately. It would also make maintenance windows shorter...make changes on secondary machine, test, reboot primary to force the secondary to become master. The two machines should be equals. I have no preference on running on one machine or the other. IF nothing breaks, I'd prefer that the one that is serving keep serving until I tell it otherwise. Both machines should have no issue with performance with the tasks they have, lots of proc, lots of RAM, nvme disk, etc. Here's the problem I've seen: I have my two machines flipping state randomly(?). This bothers me because that means it is breaking people's downloads. Longest period betweek flips was less than two weeks. So ... I cranked up the carp logging to 5 and then 7 to see what it had to say about why...and it had almost nothing to say. Here is the info from messages from both machines for the most recent flip. Past ones look basically the same. Node 2: /var/log $ zgrep carp0 messages May 9 21:51:23 node2 /bsd: carp0: state transition: BACKUP -> MASTER May 9 21:51:25 node2 /bsd: carp0: state transition: MASTER -> BACKUP May 11 16:36:04 node2 /bsd: carp0: state transition: BACKUP -> MASTER Node 1: /var/log $ zgrep carp messages May 9 21:51:25 node1 /bsd: carp0: state transition: MASTER -> BACKUP May 9 21:51:28 node1 /bsd: carp0: state transition: BACKUP -> MASTER May 11 16:36:07 node1 /bsd: carp0: state transition: MASTER -> BACKUP hostname.carp0 from both machines: inet a.b.c.240 255.255.255.0 128.100.17.255 vhid 1 carpdev ixl0 pass censored inet alias a.b.c.241 255.255.255.255 128.100.17.255 inet alias a.b.c.243 255.255.255.255 128.100.17.255 inet alias a.b.c.246 255.255.255.255 128.100.17.255 verified identical (before slight anonymizing) on both systems. hostname.ixl0 on node1: inet a.b.c.248/24 hostname.ixl0 on node2: inet a.b.c.247 0xff00 pf.conf includes this before any other "quick" statements: pass quick inet proto carp all Is there something I'm missing? Incorrect expectations on my part? Nick. dmesg: OpenBSD 7.3-current (GENERIC.MP) #1175: Wed May 3 08:19:33 MDT 2023 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 50078154752 (47758MB) avail mem = 48540807168 (46292MB) random: good seed from bootblocks mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 3.2 @ 0x6f3c3000 (84 entries) bios0: vendor American Megatrends Inc. version "3.4" date 10/30/2020 bios0: Supermicro X11SPW-TF efi0 at bios0: UEFI 2.7 efi0: American Megatrends rev 0x5000e acpi0 at bios0: ACPI 6.2 acpi0: sleep states S0 S4 S5 acpi0: tables DSDT FACP FPDT FIDT SPMI UEFI SSDT MCFG HPET APIC MIGT MSCT PCAT PCCT RASF SLIT SRAT SVOS WDDT OEM4 OEM1 SSDT OEM3 SSDT SSDT DMAR HEST BERT ERST EINJ WSMT acpi0: wakeup devices XHCI(S4) RP17(S4) PXSX(S4) RP18(S4) PXSX(S4) RP19(S4) PXSX(S4) RP20(S4) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) RP04(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpimcfg0 at acpi0 acpimcfg0: addr 0x8000, bus 0-255 acpihpet0 at acpi0: 2399 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz, 1900.06 MHz, 06-55-07 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PT,AVX512CD,AVX512BW,AVX512VL,PKU,WAITPKG,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 16-way L2 cache, 8MB 64b/line 11-way L3 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 25MHz cpu0: mwait min=64, max=64, C-substates=0.2.0.2, IBE cpu1 at mainbus0: apid 2 (application processor) cpu1: Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz, 1900.09 MHz, 06-55-07 cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PT,AVX512CD,AVX512BW,AVX512VL,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1