Broadcast behavior in 4.7 [Was: Re: Trying to set diskless(8) -- hanging in "RPC timeout for server"]
I just happened to run into the same issue right after upgrading to 4.7 (however, you mention 4.6, so I'm uncertain we're dealing with the same cause). Basically, the issue I'm seeing is that portmap/rpc.bootparamd don't see the incoming packets for 172.16.255.255 (my own network being 172.16.5.0/25, so broadcast is 172.16.5.127). There were some changes made to sys/netinet/in.c, especially rev 1.56. As far as I know, the diskless machine cannot learn its netmask through RARP, so will assume a netmask based on the class of the network the machine is in, hence the 172.16.255.255 broadcast. Before rev 1.56 of netinet/in.c, it seems the kernel would accept broadcasts for the broadcast address associated to your network "class". Or at least that's the behavior I observe when running "portmap -d". After updating to 1.56 and up, portmap/rpc.bootparamd don't see the requests for 172.16.255.255. As a workaround, I succeeded by either keeping a 4.6 kernel around to answer the bootparam requests, or forcing a broadcast address of 172.16.255.255 on the bootparamd server. Not particularly clean, but it did the trick. As for a permanent fix, I am unsure. I don't know of any way other than RARP to do diskless in OpenBSD, at least on i386/amd64. Any thoughts? -- Pascal On Wed, May 12, 2010 at 12:30:39AM +0200, Stefan Unterweger wrote: > * Fred Crowson on Tue, May 11, 2010 at 10:43:09PM +0100: > > What does your dhcpd.conf look like on your server? > > I have several subnets served via DHCP, so I have reported only > the relevant one together with the global options: > > | server-name "Neu-Sorpigal"; > | option domain-name "intranet.aleturo.com"; > | default-lease-time 86400; > | > | shared-network wired { > | option domain-name "wired.intranet.aleturo.com"; > | option domain-name-servers 172.23.12.2; > | option netbios-name-servers 172.23.12.2; > | option routers 172.23.12.2; > | > | filename "pxeboot"; > | next-server 172.23.12.2; > | option root-path "/export/client/"; > | > | subnet 172.23.0.0 netmask 255.255.0.0 { > | allow unknown-clients; > | range 172.23.13.128 172.23.13.254; > | } > | } > > I've added the options "next-server" and "root-path" just now, > since I've seen mention of it in pxeboot(8). Prior to that, only > the "filename" directive was there. Everything else however, > including the tcpdumps, is not impressed by that. > > > It might be worth having -vv and -X on your tcpdump it might provide > > more info as to the problem. > > I didn't include the dump from phase 2, where pxeboot and the > kernel are served by tftp and whatelse, since that's an insane > amount of data. This tcpdump was started just before the kernel > tried to connect to NFS, that is, before the second burst. > > | $ tcpdump -X -vv -n -s 160 -i em0 host 172.23.13.138 > | tcpdump: listening on em0, link-type EN10MB > | 00:19:48.612571 rarp reply 00:00:e2:87:e8:76 at 172.23.13.138 > | : 0001 0800 0604 0004 000e 0c06 be26 ac17 >&,. > | 0010: 0c02 e287 e876 ac17 0d8ab.hv,... > | > | 00:19:48.613207 arp who-has 172.23.13.138 tell 172.23.13.138 > | : 0001 0800 0604 0001 e287 e876 ac17 ..b.hv,. > | 0010: 0d8a ac17 0d8a ,... > | 0020: .. > | > | 00:19:48.630322 172.23.13.138.718 > 172.23.255.255.111: [udp sum ok] udp 96 (ttl 64, id 65499, len 124) > | : 4500 007c ffdb 4011 14dd ac17 0d8a E..|...@..],... > | 0010: ac17 02ce 006f 0068 eac4 90ad 0bca ,..N.o.hjD.-.J > | 0020: 0002 0001 86a0 0002 ... > | 0030: 0005 0001 0014 > | 0040: > | 0050: 0001 86ba 0001 ...: > | 0060: 0001 0014 0001 00ac ..., > | 0070: 0017 000d 008a > | > | 00:19:49.620480 172.23.13.138.718 > 172.23.255.255.111: [udp sum ok] udp 96 (ttl 64, id 60019, len 124) > | : 4500 007c ea73 4011 2a45 ac17 0d8a E..|j...@.*e,... > | 0010: ac17 02ce 006f 0068 eac4 90ad 0bca ,..N.o.hjD.-.J > | 0020: 0002 0001 86a0 0002 ... > | 0030: 0005 0001 0014 > | 0040: > | 0050: 0001 86ba 0001 ...: > | 0060: 0001 0014 0001 00ac ..., > | 0070: 0017 000d 008a > | > | 00:19:51.620513 172.23.13.138.718 > 172.23.255.255.111: [udp sum ok] udp 96 (ttl 64, id 63711, len 124) > | : 4500 007c f8df 4011 1bd9 ac17 0d8a E..|x...@..y,... > | 0010: ac17 02ce 006f 0068 eac4 90ad 0bca ,..N.o.hjD.-.J > | 0020: 0002
Re: Removing pf_pool
On Wed, Jan 13, 2010 at 01:58:30PM +0900, Ryan McBride wrote: > > My first thought is to wonder why you're not running with a symmetrical > cluster. But I realise that we are not always in control of such things, > and one of PFs functions is to get help people work around bad network > design. Right on. We depend heavily on "weights". We have a site that receives many hits/sec, with a bunch of dual-quad cores behind, processing heavy pages (which we have no control over ;-). Even though most have the same amount of RAM and cores, a difference in the processor model will require such a weight adjustment to prevent it from going overboard. We're tight on resources, both computer and monetary ... a common story I suppose. > There are a few things you can do here to get a similar effect. > > 2) Use the 'probability' keyword > > pass quick on em0 inet proto tcp from any to 192.168.100.100 \ > probability 50% rdr-to 10.0.0.1 > pass quick on em0 inet proto tcp from any to 192.168.100.100 \ > probability 70% rdr-to 10.0.0.2 > pass quick on em0 inet proto tcp from any to 192.168.100.100 \ > rdr-to 10.0.0.3 I hadn't thought of this one. It might be a good solution for us. Thanks for the tip. > The changes just committed are actually cleanup that needs to happen if > you want to see some more intelligent weighted load balancing in PF than > these hacks. But that is still a far ways off, definately after 4.7. Still, I'm very glad to hear that the idea has been floating around. Thanks a lot, -- Pascal
Removing pf_pool
I just caught the following from openbsd-cvs: http://marc.info/?l=openbsd-cvs&m=126326657232193&w=2 If my understanding is correct, this means that it will become impossible to emulate weighted round robin with constructs like the one below, since duplicate IPs will be "flattened" once converted to a standard PF table? rdr on em0 inet proto tcp \ from any to 192.168.100.100 port = www -> { 10.0.0.1, 10.0.0.1, 10.0.0.1, \ 10.0.0.2, 10.0.0.2, \ 10.0.0.3 \ } round-robin Is this right? -- Pascal
Re: ifstated with carp0
On Mon, Sep 28, 2009 at 08:06:36AM +0200, Laurent CARON wrote: > On 28/09/2009 04:28, Steven Surdock wrote: >> ... >> HERE IS IFSTATED DETECTING THE FAILOVER, WHICH SHOULD HAVE HAPPENED ON >> SEP 25, BUT DIDN'T >> Sep 26 14:19:03 fw2 ifstated[16189]: changing state to normal >> Sep 26 14:19:03 fw2 ifstated[16189]: running date|mail -s 'FW2 is now >> the backup firewall' root <...> > > I feel happy not to be the only one experiencing this behavior, although > this might be a config error on both sides ;) > This looks quite familiar to me as well. Have a look here: http://marc.info/?l=openbsd-misc&m=124942995116023&w=2 Could you try testing CARP failover and monitoring with "route -n monitor" ? If it's really a bug with ifstated, route monitor should catch all the state changes I suppose. But in my case it didn't. Would be nice if someone else could confirm the behavior I'm getting. Thanks, -- Pascal
Re: ifstated with multiple CARP interfaces
On Tue, Aug 04, 2009 at 01:20:17AM +, Stuart Henderson wrote: > I don't understand what you mean by "VLAN on carp1", can you explain it > a bit more please? My bad. I confused things a little. It's as you say, carpdevs set to vlan interfaces. In this case, carp1010 and carp1011 have vlan1010 and vlan1011 respectively as carpdevs, and those vlans both have em3 as their parent interface. There is also carp3 that has em3 as its carpdev. So: carp0 (index 11) carpdev em0 carp1 (index 12) carpdev em1 carp1010 (index 13) carpdev vlan1010 (which is a vlan on em3) carp1011 (index 14) carpdev vlan1011 (which is another vlan on em3) carp2 (index 15) carpdev em2 carp3 (index 16) carpdev em3 > Do you see the same result from other software e.g. "route -n monitor"? > (use recent -current [or, if your dns is ok, remove the -n option] to > display interface names rather than index numbers) Neat. I didn't know about route monitor. So I tried again with route -n monitor, with 4.5 GENERIC.MP. Here's the output after setting it to master: got message of size 208 on Tue Aug 4 19:45:06 2009 RTM_IFINFO: iface status change: len 208, if# 12, link: master, flags: got message of size 104 on Tue Aug 4 19:45:06 2009 RTM_DELETE: Delete Route: len 104, priority 0, table 0, pid: 0, seq 0, errno 3 flags: locks: inits: sockaddrs: 10.0.1.250 got message of size 120 on Tue Aug 4 19:45:06 2009 RTM_ADD: Add Route: len 120, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: 10.0.1.250 10.0.1.250 got message of size 100 on Tue Aug 4 19:45:06 2009 RTM_NEWADDR: address being added to iface: len 100, metric 0, flags: sockaddrs: ::::: 00:00:5e:00:01:78 fe80::200:5eff:fe00:178%carp1 got message of size 136 on Tue Aug 4 19:45:06 2009 RTM_ADD: Add Route: len 136, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: fe80::200:5eff:fe00:178%carp1 00:00:5e:00:01:78 got message of size 104 on Tue Aug 4 19:45:06 2009 RTM_DELETE: Delete Route: len 104, priority 0, table 0, pid: 0, seq 0, errno 3 flags: locks: inits: sockaddrs: 10.0.1.15 got message of size 132 on Tue Aug 4 19:45:06 2009 RTM_ADD: Add Route: len 132, priority 0, table 0, pid: 0, seq 0, errno 17 flags: locks: inits: sockaddrs: 10.0.1.0 10.0.1.15 255.255.255.0 default got message of size 208 on Tue Aug 4 19:45:06 2009 RTM_IFINFO: iface status change: len 208, if# 13, link: master, flags: got message of size 104 on Tue Aug 4 19:45:06 2009 RTM_DELETE: Delete Route: len 104, priority 0, table 0, pid: 0, seq 0, errno 3 flags: locks: inits: sockaddrs: 10.0.10.250 got message of size 120 on Tue Aug 4 19:45:06 2009 RTM_ADD: Add Route: len 120, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: 10.0.10.250 10.0.10.250 got message of size 104 on Tue Aug 4 19:45:06 2009 RTM_NEWADDR: address being added to iface: len 104, metric 0, flags: sockaddrs: ::::: 00:00:5e:00:01:6e fe80::200:5eff:fe00:16e%carp1010 got message of size 136 on Tue Aug 4 19:45:06 2009 RTM_ADD: Add Route: len 136, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: fe80::200:5eff:fe00:16e%carp1010 00:00:5e:00:01:6e got message of size 208 on Tue Aug 4 19:45:06 2009 RTM_IFINFO: iface status change: len 208, if# 11, link: master, flags: got message of size 104 on Tue Aug 4 19:45:06 2009 RTM_DELETE: Delete Route: len 104, priority 0, table 0, pid: 0, seq 0, errno 3 flags: locks: inits: sockaddrs: 10.137.16.192 got message of size 120 on Tue Aug 4 19:45:06 2009 RTM_ADD: Add Route: len 120, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: 10.137.16.192 10.137.16.192 got message of size 100 on Tue Aug 4 19:45:06 2009 RTM_NEWADDR: address being added to iface: len 100, metric 0, flags: sockaddrs: ::::: 00:00:5e:00:01:6e fe80::200:5eff:fe00:16e%carp0 got message of size 136 on Tue Aug 4 19:45:06 2009 RTM_ADD: Add Route: len 136, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: fe80::200:5eff:fe00:16e%carp0 00:00:5e:00:01:6e got message of size 208 on Tue Aug 4 19:45:06 2009 RTM_IFINFO: iface status change: len 208, if# 14, link: master, flags: And back to slave: got message of size 208 on Tue Aug 4 19:45:32 2009 RTM_IFINFO: iface status change: len 208, if# 11, link: backup, flags: got message of size 104 on Tue Aug 4 19:45:32 2009 RTM_DELETE: Delete Route: len 104, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: 10.137.16.192 got message of size 136 on Tue Aug 4 19:45:32 2009 RTM_DELETE: Delete Route: len 136, priority 0, table 0, pid: 0, seq 0, errno 0 flags: locks: inits: sockaddrs: fe80::200:5eff:fe00:16e%carp0 00:00:5e:00:01:6e got message of size 100 on Tue Aug 4 19:45:32 2009 RTM_DELADDR: address being removed from iface: len 100, metric 0, flags: sockaddrs: ::::: 00:00:5e:00:01:6e f
ifstated with multiple CARP interfaces
Hello, we have a problem with ifstated detecting state change on multiple CARP interfaces. After digging deeper, it seems that reading on the routing socket does not give us all the state changes that we'd expect. We tried with the latest snapshot kernel and got the same behavior. Our CARP interfaces are as follows: carp0 carp1 carp1010 (VLAN on carp1) carp1011 (Another VLAN on carp1) carp2 carp3 The condition we'd like to test is: carp_up = 'carp0.link.up \ && carp1.link.up \ && carp1010.link.up \ && carp1011.link.up \ && carp2.link.up \ && carp3.link.up' Doing a little check using a quick C program (see below), and playing with the demote counter, we can clearly see why the condition is not always met as it should: # ./getifinfo & [1] 20942 # ifconfig -g carp carpdemote 50 carp0 -> LINK_STATE_DOWN carp2 -> LINK_STATE_DOWN carp3 -> LINK_STATE_DOWN carp1010 -> LINK_STATE_DOWN carp1011 -> LINK_STATE_DOWN # ifconfig -g carp -carpdemote 50 carp1 -> LINK_STATE_UP carp1010 -> LINK_STATE_UP carp0 -> LINK_STATE_UP carp2 -> LINK_STATE_UP carp1011 -> LINK_STATE_UP carp3 -> LINK_STATE_UP # ifconfig -g carp carpdemote 50 carp0 -> LINK_STATE_DOWN carp1 -> LINK_STATE_DOWN carp2 -> LINK_STATE_DOWN carp3 -> LINK_STATE_DOWN carp1010 -> LINK_STATE_DOWN # ifconfig -g carp -carpdemote 50 carp0 -> LINK_STATE_UP carp1 -> LINK_STATE_UP carp1010 -> LINK_STATE_UP carp1011 -> LINK_STATE_UP # Question is, is it normal that the routing socket doesn't report all changes? Anyone else having similar issues? Thanks in advance, -- Pascal getifinfo.c: #include #include #include #include #include #include #include #include #include char *if_states[] = { "LINK_STATE_UNKNOWN", "LINK_STATE_DOWN", "LINK_STATE_UP", "LINK_STATE_HALF_DUPLEX", "LINK_STATE_FULL_DUPLEX" }; int main(int argc, char **argv) { int rt_fd; char msg[2048]; struct ifaddrs *ifap, *ifa; struct rt_msghdr*rtm = (struct rt_msghdr *)&msg; struct if_msghdr*ifm = (struct if_msghdr *)&msg; int len; char ifs[64][16]; if (getifaddrs(&ifap)) err(1, "getifaddrs"); for (ifa = ifap; ifa->ifa_next != NULL; ifa = ifa->ifa_next) { strlcpy(ifs[if_nametoindex(ifa->ifa_name)], ifa->ifa_name, 16); } freeifaddrs(ifap); if ((rt_fd = socket(PF_ROUTE, SOCK_RAW, 0)) < 0) err(1, "no routing socket"); while ((len = read(rt_fd, msg, sizeof(msg { if (len < sizeof(struct rt_msghdr)) { warnx("len < sizeof(struct rt_msghdr)"); continue; } if (rtm->rtm_version != RTM_VERSION) continue; if (rtm->rtm_type != RTM_IFINFO) continue; printf("%s -> %s\n", ifs[ifm->ifm_index], if_states[ifm->ifm_data.ifi_link_state]); } return 0; }
Re: state key linking mismatch w/GRE, since 4.5
On Fri, Jun 12, 2009 at 05:56:43AM +0200, Henning Brauer wrote: > * Pascal Lalonde [2009-06-12 00:28]: > > Jun 11 18:08:19 celeborn /bsd: pf: state key linking mismatch! dir=OUT, > > if=bge0, stored af=2, a0: 10.136.192.199:30285, a1: 10.216.8.1:22, > > proto=6, found af=2, a0: AAA.AAA.AAA.AAA, a1: BBB.BBB.BBB.BBB, proto=47. > > Jun 11 18:08:21 celeborn /bsd: pf: state key linking mismatch! dir=OUT, > > if=bge0, stored af=2, a0: 10.136.248.119:42137, a1: 10.137.0.130:993, > > proto=6, found af=2, a0: AAA.AAA.AAA.AAA, a1: BBB.BBB.BBB.BBB, proto=47. > > fixed in -current and no need to worry really Small followup on this, for people who would happen to run in the same problem. We were just bitten by this issue. With our smaller VPN gateways (<10 flows with ESP/GRE), the extra logging didn't cause any issues. But once we upgraded our main VPN endpoint (roughly 176 flows) to 4.5, seems it didn't like the amount of printf()'s generated; the load would make it unusuable, causing CARP flapping, with a very high (>80%) interrupt%. Fortunately we still had our other node in 4.4 to fallback to. I can confirm that on our test setup with a -current kernel, those messages don't show up anymore. In the meantime, we applied the following to let us control whether we wish to see those warnings or not: --- sys/net/pf.c.orig Tue Jun 30 18:13:34 2009 +++ sys/net/pf.cTue Jun 30 18:44:00 2009 @@ -860,19 +860,22 @@ return (0); else { /* mismatch. must not happen. */ - printf("pf: state key linking mismatch! dir=%s, " - "if=%s, stored af=%u, a0: ", - dir == PF_OUT ? "OUT" : "IN", kif->pfik_name, a->af); - pf_print_host(&a->addr[0], a->port[0], a->af); - printf(", a1: "); - pf_print_host(&a->addr[1], a->port[1], a->af); - printf(", proto=%u", a->proto); - printf(", found af=%u, a0: ", b->af); - pf_print_host(&b->addr[0], b->port[0], b->af); - printf(", a1: "); - pf_print_host(&b->addr[1], b->port[1], b->af); - printf(", proto=%u", b->proto); - printf(".\n"); + if (pf_status.debug >= PF_DEBUG_MISC) { + printf("pf: state key linking mismatch! dir=%s, " + "if=%s, stored af=%u, a0: ", + dir == PF_OUT ? "OUT" : "IN", kif->pfik_name, + a->af); + pf_print_host(&a->addr[0], a->port[0], a->af); + printf(", a1: "); + pf_print_host(&a->addr[1], a->port[1], a->af); + printf(", proto=%u", a->proto); + printf(", found af=%u, a0: ", b->af); + pf_print_host(&b->addr[0], b->port[0], b->af); + printf(", a1: "); + pf_print_host(&b->addr[1], b->port[1], b->af); + printf(", proto=%u", b->proto); + printf(".\n"); + } return (-1); } } -- Pascal
Weighted round robin
Hello, I was wondering about how to achieve some kind of weighted round-robin with OpenBSD... So far, we can achieve some limited weighted round-robin by using rdr's with lists, and repeating the stronger nodes in the list. (Is there a limit to the number of nodes in a list??) This is what we use on our webserver pool. Having this kind of control is necessary, since pretty much every unused machine we have ends up in the webserver pool to meet growing demand. The only sad thing is that we can't use relayd to automate removal of failed nodes, because we don't use tables. We're planning to resort on a scripted solution to generate a ruleset based on our own host checks (using rdr lists, anchors, etc.) . But before taking this path, I was wondering if people on misc@ had found clever ways of achieving the same result without too much scripting. Thanks in advance, -- Pascal
Re: state key linking mismatch w/GRE, since 4.5
On Fri, Jun 12, 2009 at 05:56:43AM +0200, Henning Brauer wrote: > * Pascal Lalonde [2009-06-12 00:28]: > > Jun 11 18:08:19 celeborn /bsd: pf: state key linking mismatch! dir=OUT, > > if=bge0, stored af=2, a0: 10.136.192.199:30285, a1: 10.216.8.1:22, > > proto=6, found af=2, a0: AAA.AAA.AAA.AAA, a1: BBB.BBB.BBB.BBB, proto=47. > > Jun 11 18:08:21 celeborn /bsd: pf: state key linking mismatch! dir=OUT, > > if=bge0, stored af=2, a0: 10.136.248.119:42137, a1: 10.137.0.130:993, > > proto=6, found af=2, a0: AAA.AAA.AAA.AAA, a1: BBB.BBB.BBB.BBB, proto=47. > > fixed in -current and no need to worry really Good to hear! Many thanks, -- Pascal
state key linking mismatch w/GRE, since 4.5
Hello, recently we upgraded some of our firewalls from OpenBSD 4.4 to 4.5. Since then, we've been getting loads of the following message (external addresses substitued with AAA's and BBB's): Jun 11 18:08:19 celeborn /bsd: pf: state key linking mismatch! dir=OUT, if=bge0, stored af=2, a0: 10.136.192.199:30285, a1: 10.216.8.1:22, proto=6, found af=2, a0: AAA.AAA.AAA.AAA, a1: BBB.BBB.BBB.BBB, proto=47. Jun 11 18:08:21 celeborn /bsd: pf: state key linking mismatch! dir=OUT, if=bge0, stored af=2, a0: 10.136.248.119:42137, a1: 10.137.0.130:993, proto=6, found af=2, a0: AAA.AAA.AAA.AAA, a1: BBB.BBB.BBB.BBB, proto=47. Relevant states, taken right after the errors showed up in syslog: all gre BBB.BBB.BBB.BBB <- AAA.AAA.AAA.AAA MULTIPLE:MULTIPLE all tcp 10.216.8.1:22 <- 10.136.192.199:30285 ESTABLISHED:ESTABLISHED all tcp 10.136.192.199:30285 -> 10.216.8.1:22 ESTABLISHED:ESTABLISHED all tcp 10.137.0.130:993 <- 10.136.248.119:42137 FIN_WAIT_2:FIN_WAIT_2 all tcp 10.136.248.119:42137 -> 10.137.0.130:993 FIN_WAIT_2:FIN_WAIT_2 gre25: flags=9011 mtu 1476 description: TUNNELING-10/8 priority: 0 groups: gre physical address inet BBB.BBB.BBB.BBB --> AAA.AAA.AAA.AAA inet6 fe80::204:23ff:feb1:73c4%gre25 -> prefixlen 64 scopeid 0x12 inet 192.168.253.136 --> 192.168.136.253 netmask 0x Internet: DestinationGatewayFlags Refs Use Mtu Prio Iface defaultBBB.BBB.BBB.CCCUGS4 1317018 - 8 bge0 10/8 192.168.136.253UGS0 769241 - 8 gre25 10.136.248/21 link#4 UC140 - 4 em3 BBB.BBB.BBB.0/27 link#9 UC110 - 4 bge0 ... Status: Enabled for 0 days 02:24:21 Debug: Urgent State Table Total Rate current entries 6281 searches14179937 1637.2/s inserts 586841 67.8/s removals 580560 67.0/s Counters match 498717 57.6/s bad-offset 00.0/s fragment 00.0/s short 00.0/s normalize 00.0/s memory 00.0/s bad-timestamp 00.0/s congestion 00.0/s ip-option 00.0/s proto-cksum00.0/s state-mismatch280.0/s state-insert 50.0/s state-limit00.0/s src-limit 00.0/s synproxy 00.0/s This is happening only on firewalls where we use GRE tunnels. I guess that rev. 1.618 of pf.c, which was added in 4.5, is causing those messages to appear. But we're not experimenting any network problems despite the errors. The ruleset being a bit lengthy, I left it out, but can send it on demand. Is there need to worry about those errors? Thanks, -- Pascal
Re: relayd - Hosts flapping unexpectedly
On Thu, May 21, 2009 at 11:05:40AM +0100, Dan Carley wrote: > > We've been playing with relayd recently - both from 4.5 and the latest > snapshot. > > Approximately every hour we are seeing one or two state changes logged. But > I can't see reason for the change of state and there doesn't appear to be a > pattern in the way that the hosts are failed. We just happen to notice the same thing here. Here's the info I could gather on this, but I suspect the problem might not be relayd itself. My relayd configuration is as such: relayd.conf: interval 5 log updates timeout 3000 table { 10.0.1.10 10.0.2.10 10.0.10.10 } redirect test2 { listen on 10.0.1.15 port 30099 forward to check tcp } redirect test { listen on 10.137.16.192 port 30100 forward to check tcp } # relayctl show summary Id TypeNameAvlblty Status 1 redirecttest2 active 1 table floods:30099active (3 hosts) 1 host10.0.1.10 100.00% up 2 host10.0.2.10 100.00% up 3 host10.0.10.10 100.00% up 2 redirecttestactive 2 table floods:30100active (3 hosts) 4 host10.0.1.10 100.00% up 5 host10.0.2.10 100.00% up 6 host10.0.10.10 100.00% up Now, at random times (1-2 / hour average), we get the following error in the logs: May 26 18:00:31 testfw1 relayd[25554]: host 10.0.1.10, check tcp (0ms), state up -> down, availability 99.92% May 26 18:00:36 testfw1 relayd[25554]: host 10.0.1.10, check tcp (0ms), state down -> up, availability 99.92% But, we can confirm that the service does not go down in reality. The firewalls are redundant with the same relayd config, and they don't see the service going down at the same time (they do, however, both get the same behavior for up/down's). Adding some debugging code in relayd, I found that connect() returns EADDRINUSE at check_tcp.c:87. This seemed strange at first since a few lines above the SO_REUSEPORT is set on the socket. Also, the firewalls used to test this are almost sleeping with less than 100 sockets at a time, mostly used by relayd performing TCP checks. So we're clearly not running out of ephemeral ports. Just for the sake of trying, I took the CVS source for relayd, commented out the SO_REUSEPORT option, recompiled and restarted it. Strangely, now the up/down's are gone. I would expect SO_REUSEPORT to prevent EADDRINUSE errors, so I'm a bit puzzled... Could anyone help shed light on this? Thanks, -- Pascal
relayctl host disable doesn't loop through all hosts
Hello, I've been playing with relayd lately. There is a behavior which seems unintuitive and I was wondering if that was a bug or the intended behavior. When I try to disable a host (e.g.: relayctl host disable 10.0.1.101), and that host is part of more than one table, only the first occurence gets disabled. I'm testing with relayd from Feb 28th snapshot. I would suppose it should disable all occurences, since disabling by ID already lets you choose specific instances of that host. # relayctl show summary Id TypeNameAvlblty Status 1 redirecttestactive 1 table test:8080 active (3 hosts) 1 host10.0.1.101 100.00% up 2 host10.0.1.102 100.00% up 3 host10.0.1.103 100.00% up 2 redirecttest2 active 2 table test2:3 active (6 hosts) 4 host10.0.1.101 100.00% up 5 host10.0.1.102 100.00% up 6 host10.0.1.103 100.00% up 7 host10.0.1.104 100.00% up 8 host10.0.1.105 100.00% up 9 host10.0.1.106 100.00% up # relayctl host disable 10.0.1.101 command succeeded # relayctl show summary Id TypeNameAvlblty Status 1 redirecttestactive 1 table test:8080 active (2 hosts) 1 host10.0.1.101 disabled 2 host10.0.1.102 100.00% up 3 host10.0.1.103 100.00% up 2 redirecttest2 active 2 table test2:3 active (6 hosts) 4 host10.0.1.101 100.00% up 5 host10.0.1.102 100.00% up 6 host10.0.1.103 100.00% up 7 host10.0.1.104 100.00% up 8 host10.0.1.105 100.00% up 9 host10.0.1.106 100.00% up Thanks in advance! -- Pascal