i have a pair of openbsd boxes, each running a secondary sshd service on alternate port on their primary ip addresses. in addition to the primary address, they also share a carp address.
the secondary sshd service listens on the primary address, and alternate port (10022). secondary sshd configuration, users, directories and such are synchronized between the two boxes so that the secondary sshd services are identical excepting for their listening address. each of these boxes is also running relayd. the configurations are identical, being synchronized the same way as the secondary sshd services are. relayd listens on the carp address, and relays inbound connections to the two primary addresses on the alternate ports. the overall goal is to provide a clustered sftp service that protects against single points of failure and allows for growth and maintenance (eg. adding more cluster nodes and/or bringing one down for patching). overall this appears to be working, but (and here is why i'm posting) ... relayd has been failing its backend hosts at somewhat random intervals, with complaints of timeouts. we don't always see these (though they are definitely being logged) because relayd simply relays connections to the boxes which remain "up" ... however, we've seen periods where both backend services get marked down at the same time and then our inbound transmissions fail. at that point it comes to everybody's attention. initially i think we're having some kind of issue with our switching or something. but then i realize that relayd is failing the service running on the same host as well as other hosts on the network. if relayd is listening on address X on a carp interface and talking to address Y on the same physical interface which is related to the carp interface, do these packets physically leave the box or does openbsd route those packets locally? if the former, we could still have network switching issues. if the latter, i probably have an issue with my configuration on openbsd. and in the case of the latter, i'll show my configuration below in hopes that someone can point something out that i'm doing incorrectly. i'll also include dmesg. we do have cron jobs running at 15-minute intervals, which very roughly equates to at least some of the relayd messages shown below, so i have to wonder about limits, memory, etc. i haven't seen anything listed in errata which leads me to believe it's a problem with the code on the system, but i'm definitely not running the latest code. i'm more interested at this point in vetting out any configuration issues or misconceptions on my part, and possibly ruling out these systems as causes of the problem. help, as well as constructive criticism, will certainly be appreciated! ---- # w|grep load 11:47AM up 24 days, 11:04, 2 users, load averages: 0.14, 0.27, 0.55 # uname -a OpenBSD clusternode1 5.3 GENERIC.MP#58 i386 # vmstat procs memory page disks traps cpu r b w avm fre flt re pi po fr sr cd0 sd0 int sys cs us sy id 1 7 0 24020 2403692 363 0 0 0 0 0 0 0 54 1186 79 1 1 98 ======== /etc/hostname.em0 -------- inet 10.11.12.102 255.255.255.0 -inet6 -------- ======== /etc/hostname.carp100 -------- inet 10.11.12.100 255.255.255.0 NONE vhid 100 pass mycarppass advbase 1 advskew 1 description sftpcluster -inet6 -------- ======== /usr/local/etc/sftp/run -------- #!/bin/sh exec 2>&1 PORT=10022 MYDIR=`pwd` HOSTKEYS= for K in ${MYDIR}/etc/*key ; do HOSTKEYS="${HOSTKEYS} -h ${K}" ; done SSHD=`which sshd` MYIP=`ifconfig egress | awk '/inet / { print $2 }'` exec ${SSHD} -f ${MYDIR}/sftp.config \ ${HOSTKEYS} \ -o ListenAddress=${MYIP}:${PORT} \ -D -e -------- ======== /usr/local/etc/sftp/sftp.config -------- Protocol 2 SyslogFacility AUTH LogLevel INFO PermitRootLogin no StrictModes yes MaxAuthTries 6 ##MaxSessions 10 AuthorizedKeysFile /usr/local/etc/sftp/authorized_keys/%u.pub PasswordAuthentication yes PermitEmptyPasswords no AllowAgentForwarding no AllowTcpForwarding no GatewayPorts no X11Forwarding no UsePrivilegeSeparation sandbox PermitUserEnvironment no Compression delayed ClientAliveInterval 0 ClientAliveCountMax 3 PidFile /var/run/sftp.pid ##MaxStartups 10:30:100 PermitTunnel no ChrootDirectory %h #Port 10022 UseDNS no Subsystem sftp internal-sftp ForceCommand internal-sftp -l VERBOSE -------- ======== /usr/local/etc/relayd/run -------- #!/bin/sh exec 2>&1 RELAYD=`which relayd` sleep 1 exec ${RELAYD} -d -v -f ./relayd.conf -------- ======== /usr/local/etc/relayd/relayd.conf -------- table <sftpcluster> { 10.11.12.102 10.11.12.103 } # sftpnode1/sftpnode2 relay "sftpcluster" { listen on carp100 port 10022 forward to <sftpcluster> port 10022 mode loadbalance check send 'foo' expect 'SSH-2*' } -------- output from relayd: /var/log/authlog:Mar 18 00:00:02 clusternode1 relayd: host 10.10.10.103, check send expect (209ms), state up -> down, availability 99.51% /var/log/authlog:Mar 18 00:00:02 clusternode1 relayd: host 10.10.10.102, check send expect (210ms), state up -> down, availability 99.59% /var/log/authlog:Mar 18 00:00:12 clusternode1 relayd: host 10.10.10.102, check send expect (109ms), state down -> up, availability 99.59% /var/log/authlog:Mar 18 00:00:12 clusternode1 relayd: host 10.10.10.103, check send expect (117ms), state down -> up, availability 99.51% /var/log/authlog:Mar 18 00:01:02 clusternode1 relayd: host 10.10.10.103, check send expect (209ms), state up -> down, availability 99.51% /var/log/authlog:Mar 18 00:01:02 clusternode1 relayd: host 10.10.10.102, check send expect (210ms), state up -> down, availability 99.59% /var/log/authlog:Mar 18 00:01:12 clusternode1 relayd: host 10.10.10.102, check send expect (111ms), state down -> up, availability 99.59% /var/log/authlog:Mar 18 00:01:12 clusternode1 relayd: host 10.10.10.103, check send expect (118ms), state down -> up, availability 99.51% /var/log/authlog:Mar 18 00:15:03 clusternode1 relayd: host 10.10.10.102, check send expect (211ms), state up -> down, availability 99.58% /var/log/authlog:Mar 18 00:15:03 clusternode1 relayd: host 10.10.10.103, check send expect (211ms), state up -> down, availability 99.51% /var/log/authlog:Mar 18 00:15:13 clusternode1 relayd: host 10.10.10.102, check send expect (109ms), state down -> up, availability 99.58% /var/log/authlog:Mar 18 00:15:13 clusternode1 relayd: host 10.10.10.103, check send expect (123ms), state down -> up, availability 99.51% /var/log/authlog:Mar 18 00:30:04 clusternode1 relayd: host 10.10.10.102, check send expect (305ms), state up -> down, availability 99.58% /var/log/authlog:Mar 18 00:30:04 clusternode1 relayd: host 10.10.10.103, check send expect (306ms), state up -> down, availability 99.51% /var/log/authlog:Mar 18 00:30:14 clusternode1 relayd: host 10.10.10.103, check send expect (112ms), state down -> up, availability 99.51% /var/log/authlog:Mar 18 00:30:14 clusternode1 relayd: host 10.10.10.102, check send expect (123ms), state down -> up, availability 99.58% /var/log/authlog:Mar 18 00:45:05 clusternode1 relayd: host 10.10.10.102, check send expect (209ms), state up -> down, availability 99.58% /var/log/authlog:Mar 18 00:45:14 clusternode1 relayd: host 10.10.10.102, check send expect (113ms), state down -> up, availability 99.58% /var/log/authlog:Mar 18 01:00:06 clusternode1 relayd: host 10.10.10.103, check send expect (200ms), state up -> down, availability 99.51% /var/log/authlog:Mar 18 01:00:15 clusternode1 relayd: host 10.10.10.103, check send expect (128ms), state down -> up, availability 99.51% /var/log/authlog:Mar 18 01:30:07 clusternode1 relayd: host 10.10.10.103, check send expect (202ms), state up -> down, availability 99.51% /var/log/authlog:Mar 18 01:30:17 clusternode1 relayd: host 10.10.10.103, check send expect (116ms), state down -> up, availability 99.51% /var/log/authlog:Mar 18 02:30:01 clusternode1 relayd: host 10.10.10.103, check send expect (209ms), state up -> down, availability 99.51% /var/log/authlog:Mar 18 02:30:11 clusternode1 relayd: host 10.10.10.103, check send expect (115ms), state down -> up, availability 99.51% # dmesg OpenBSD 5.3 (GENERIC.MP) #58: Tue Mar 12 18:43:53 MDT 2013 dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC.MP cpu0: AMD Opteron(tm) Processor 6278 ("AuthenticAMD" 686-class, 2048KB L2 cache) 2.40 GHz cpu0: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,NXE,MMXX,FFXSR,LONG,SSE3,CX16,POPCNT,LAHF,EAPICSP,ABM,SSE4A,MASSE,3DNOWP,OSVW,ITSC real mem = 3220697088 (3071MB) avail mem = 3157082112 (3010MB) mainbus0 at root bios0 at mainbus0: AT/286+ BIOS, date 07/30/13, BIOS32 rev. 0 @ 0xfd780, SMBIOS rev. 2.4 @ 0xe0010 (364 entries) bios0: vendor Phoenix Technologies LTD version "6.00" date 07/30/2013 bios0: VMware, Inc. VMware Virtual Platform acpi0 at bios0: rev 2 acpi0: sleep states S0 S1 S4 S5 acpi0: tables DSDT FACP BOOT APIC MCFG SRAT HPET WAET acpi0: wakeup devices PCI0(S3) USB_(S1) P2P0(S3) S1F0(S3) S2F0(S3) S3F0(S3) S4F0(S3) S5F0(S3) S6F0(S3) S7F0(S3) S8F0(S3) S9F0(S3) S10F(S3) S11F(S3) S12F(S3) S13F(S3) S14F(S3) S15F(S3) S16F(S3) S17F(S3) S18F(S3) S19F(S3) S20F(S3) S21F(S3) S22F(S3) S23F(S3) S24F(S3) S25F(S3) S26F(S3) S27F(S3) S28F(S3) S29F(S3) S30F(S3) S31F(S3) S32F(S3) P2P1(S3) S1F0(S3) S2F0(S3) S3F0(S3) S4F0(S3) S5F0(S3) S6F0(S3) S7F0(S3) S8F0(S3) S9F0(S3) S10F(S3) S11F(S3) S12F(S3) S13F(S3) S14F(S3) S15F(S3) S16F(S3) S17F(S3) S18F(S3) S19F(S3) S20F(S3) S21F(S3) S22F(S3) S23F(S3) S24F(S3) S25F(S3) S26F(S3) S27F(S3) S28F(S3) S29F(S3) S30F(S3) S31F(S3) S32F(S3) P2P2(S3) S1F0(S3) S2F0(S3) S3F0(S3) S4F0(S3) S5F0(S3) S6F0(S3) S7F0(S3) S8F0(S3) S9F0(S3) S10F(S3) S11F(S3) S12F(S3) S13F(S3) S14F(S3) S15F(S3) S16F(S3) S17F(S3) S18F(S3) S19F(S3) S20F(S3) S21F(S3) S22F(S3) S23F(S3) S24F(S3) S25F(S3) S26F(S3) S27F(S3) S28F(S3) S29F(S3) S30F(S3) S31F(S3) S32F(S3) P2P3(S3) S1F0(S3) S2F0(S3) S3F0(S3) S4F0(S3) S5F0(S3) S6F0(S3) S7F0(S3) S8F0(S3) S9F0(S3) S10F(S3) S11F(S3) S12F(S3) S13F(S3) S14F(S3) S15F(S3) S16F(S3) S17F(S3) S18F(S3) S19F(S3) S20F(S3) S21F(S3) S22F(S3) S23F(S3) S24F(S3) S25F(S3) S26F(S3) S27F(S3) S28F(S3) S29F(S3) S30F(S3) S31F(S3) S32F(S3) PE40(S3) S1F0(S3) PE50(S3) S1F0(S3) PE60(S3) S1F0(S3) PE70(S3) S1F0(S3) PE80(S3) S1F0(S3) PE90(S3) S1F0(S3) PEA0(S3) S1F0(S3) PEB0(S3) S1F0(S3) PEC0(S3) S1F0(S3) PED0(S3) S1F0(S3) PEE0(S3) S1F0(S3) PE41(S3) S1F0(S3) PE42(S3) S1F0(S3) PE43(S3) S1F0(S3) PE44(S3) S1F0(S3) PE45(S3) S1F0(S3) PE46(S3) S1F0(S3) PE47(S3) S1F0(S3) PE51(S3) S1F0(S3) PE52(S3) S1F0(S3) PE53(S3) S1F0(S3) PE54(S3) S1F0(S3) PE55(S3) S1F0(S3) PE56(S3) S1F0(S3) PE57(S3) S1F0(S3) PE61(S3) S1F0(S3) PE62(S3) S1F0(S3) PE63(S3) S1F0(S3) PE64(S3) S1F0(S3) PE65(S3) S1F0(S3) PE66(S3) S1F0(S3) PE67(S3) S1F0(S3) PE71(S3) S1F0(S3) PE72(S3) S1F0(S3) PE73(S3) S1F0(S3) PE74(S3) S1F0(S3) PE75(S3) S1F0(S3) PE76(S3) S1F0(S3) PE77(S3) S1F0(S3) PE81(S3) S1F0(S3) PE82(S3) S1F0(S3) PE83(S3) S1F0(S3) PE84(S3) S1F0(S3) PE85(S3) S1F0(S3) PE86(S3) S1F0(S3) PE87(S3) S1F0(S3) PE91(S3) S1F0(S3) PE92(S3) S1F0(S3) PE93(S3) S1F0(S3) PE94(S3) S1F0(S3) PE95(S3) S1F0(S3) PE96(S3) S1F0(S3) PE97(S3) S1F0(S3) PEA1(S3) S1F0(S3) PEA2(S3) S1F0(S3) PEA3(S3) S1F0(S3) PEA4(S3) S1F0(S3) PEA5(S3) S1F0(S3) PEA6(S3) S1F0(S3) PEA7(S3) S1F0(S3) PEB1(S3) S1F0(S3) PEB2(S3) S1F0(S3) PEB3(S3) S1F0(S3) PEB4(S3) S1F0(S3) PEB5(S3) S1F0(S3) PEB6(S3) S1F0(S3) PEB7(S3) S1F0(S3) SLPB(S4) LID_(S4) acpitimer0 at acpi0: 3579545 Hz, 24 bits acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: AMD erratum 721 detected and fixed cpu0: apic clock running at 65MHz cpu1 at mainbus0: apid 1 (application processor) cpu1: AMD Opteron(tm) Processor 6278 ("AuthenticAMD" 686-class, 2048KB L2 cache) 2.41 GHz cpu1: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,NXE,MMXX,FFXSR,LONG,SSE3,CX16,POPCNT,LAHF,EAPICSP,ABM,SSE4A,MASSE,3DNOWP,OSVW,ITSC cpu2 at mainbus0: apid 2 (application processor) cpu2: AMD Opteron(tm) Processor 6278 ("AuthenticAMD" 686-class, 2048KB L2 cache) 2.41 GHz cpu2: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,NXE,MMXX,FFXSR,LONG,SSE3,CX16,POPCNT,LAHF,EAPICSP,ABM,SSE4A,MASSE,3DNOWP,OSVW,ITSC cpu3 at mainbus0: apid 3 (application processor) cpu3: AMD Opteron(tm) Processor 6278 ("AuthenticAMD" 686-class, 2048KB L2 cache) 2.41 GHz cpu3: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,NXE,MMXX,FFXSR,LONG,SSE3,CX16,POPCNT,LAHF,EAPICSP,ABM,SSE4A,MASSE,3DNOWP,OSVW,ITSC ioapic0 at mainbus0: apid 4 pa 0xfec00000, version 11, 24 pins acpimcfg0 at acpi0 addr 0xf0000000, bus 0-127 acpihpet0 at acpi0: 14318179 Hz acpiprt0 at acpi0: bus 0 (PCI0) acpicpu0 at acpi0 acpicpu1 at acpi0 acpicpu2 at acpi0 acpicpu3 at acpi0 acpibat0 at acpi0: BAT1 not present acpibat1 at acpi0: BAT2 not present acpiac0 at acpi0: AC unit online acpibtn0 at acpi0: SLPB acpibtn1 at acpi0: LID_ bios0: ROM list: 0xc0000/0x8000 0xc8000/0x1e00! 0xca000/0x1000 0xdc000/0x4000! 0xe0000/0x8000! vmt0 at mainbus0 pci0 at mainbus0 bus 0: configuration mode 1 (bios) pci12 at ppb11 bus 12 ppb12 at pci0 dev 22 function 2 "VMware Virtual PCIE-PCIE" rev 0x01 pci13 at ppb12 bus 13 ppb13 at pci0 dev 22 function 3 "VMware Virtual PCIE-PCIE" rev 0x01 pci14 at ppb13 bus 14 ppb14 at pci0 dev 22 function 4 "VMware Virtual PCIE-PCIE" rev 0x01 pci15 at ppb14 bus 15 ppb15 at pci0 dev 22 function 5 "VMware Virtual PCIE-PCIE" rev 0x01 pci16 at ppb15 bus 16 ppb16 at pci0 dev 22 function 6 "VMware Virtual PCIE-PCIE" rev 0x01 pci17 at ppb16 bus 17 ppb17 at pci0 dev 22 function 7 "VMware Virtual PCIE-PCIE" rev 0x01 pci18 at ppb17 bus 18 ppb18 at pci0 dev 23 function 0 "VMware Virtual PCIE-PCIE" rev 0x01 pci19 at ppb18 bus 19 ppb19 at pci0 dev 23 function 1 "VMware Virtual PCIE-PCIE" rev 0x01 pci20 at ppb19 bus 20 ppb20 at pci0 dev 23 function 2 "VMware Virtual PCIE-PCIE" rev 0x01 pci21 at ppb20 bus 21 ppb21 at pci0 dev 23 function 3 "VMware Virtual PCIE-PCIE" rev 0x01 pci22 at ppb21 bus 22 ppb22 at pci0 dev 23 function 4 "VMware Virtual PCIE-PCIE" rev 0x01 pci23 at ppb22 bus 23 ppb23 at pci0 dev 23 function 5 "VMware Virtual PCIE-PCIE" rev 0x01 pci24 at ppb23 bus 24 ppb24 at pci0 dev 23 function 6 "VMware Virtual PCIE-PCIE" rev 0x01 pci25 at ppb24 bus 25 ppb25 at pci0 dev 23 function 7 "VMware Virtual PCIE-PCIE" rev 0x01 pci26 at ppb25 bus 26 ppb26 at pci0 dev 24 function 0 "VMware Virtual PCIE-PCIE" rev 0x01 pci27 at ppb26 bus 27 ppb27 at pci0 dev 24 function 1 "VMware Virtual PCIE-PCIE" rev 0x01 pci28 at ppb27 bus 28 ppb28 at pci0 dev 24 function 2 "VMware Virtual PCIE-PCIE" rev 0x01 pci29 at ppb28 bus 29 ppb29 at pci0 dev 24 function 3 "VMware Virtual PCIE-PCIE" rev 0x01 pci30 at ppb29 bus 30 ppb30 at pci0 dev 24 function 4 "VMware Virtual PCIE-PCIE" rev 0x01 pci31 at ppb30 bus 31 ppb31 at pci0 dev 24 function 5 "VMware Virtual PCIE-PCIE" rev 0x01 pci32 at ppb31 bus 32 ppb32 at pci0 dev 24 function 6 "VMware Virtual PCIE-PCIE" rev 0x01 pci33 at ppb32 bus 33 ppb33 at pci0 dev 24 function 7 "VMware Virtual PCIE-PCIE" rev 0x01 pci34 at ppb33 bus 34 isa0 at piixpcib0 isadma0 at isa0 com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo pckbc0 at isa0 port 0x60/5 pckbd0 at pckbc0 (kbd slot) pckbc0: using irq 1 for kbd slot wskbd0 at pckbd0: console keyboard, using wsdisplay0 pms0 at pckbc0 (aux slot) pckbc0: using irq 12 for aux slot wsmouse0 at pms0 mux 0 pcppi0 at isa0 port 0x61 spkr0 at pcppi0 lpt0 at isa0 port 0x378/4 irq 7 npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16 fdc0 at isa0 port 0x3f0/6 irq 6 drq 2 fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec mtrr: Pentium Pro MTRR support vscsi0 at root scsibus2 at vscsi0: 256 targets softraid0 at root scsibus3 at softraid0: 256 targets root on sd0a (116ad8073d3d50b2.a) swap on sd0b dump on sd0b cpu1: AMD erratum 721 detected and fixed cpu2: AMD erratum 721 detected and fixed cpu3: AMD erratum 721 detected and fixed wsdisplay0: screen 5 deleted wsdisplay0: screen 5 added (80x50, vt100 emulation) carp100: state transition: BACKUP -> MASTER