On 2009-03-22 at 13:45 +0000, Kim Minh Kaplan wrote: > I see what you mean here... Except that periodic DNS lookups are *not* > The Right Thing. This is one area where I think SKS got it wrong: it > should call out to the resolver each time it needs to connect to a > server and let the caching happen in normal ways (DNS TTL). Please have > a look at my other message "Keep DNS mappings fresh"[1]
Oh, right, sorry -- I forgot about that, because it was incompatible as-is with the IPv6 work and so I didn't apply it. > With my patch the "additional load" is bigger but it will still be > minuscule when compared to the rest of the traffic needed for the > reconciliation protocol anyway. This is not where we should look for > optimization. You're quite right -- relying on a functioning DNS cache is the correct way to go (but see below) -- I was going for the quick and easy solution (and feeling uncomfortable while doing so). However, either the membership_reload_interval option needs to be completely removed or the reconserver.ml needs to support it -- leaving it as a dbserver-only option seems sub-optimal. Call it paranoia resulting from maintaining mail-server code in previous employment, where mtime collisions in a cluster were possible so relying on mtime-changed was a bad plan. So, "sks-mshp-timed2.patch" should probably go in -- it fixes the mailsync reload (as both those patches do) and adds the event handler for reload. With the default reload interval of 5 hours, the load addition is minimal (understatement) and the benefit is that you gain assurance that the file change *will* be picked up. Eventually. > OTOH if the membership reload takes more than the gossip_interval and > reconciliation_config_timeout setting (typically one minute) then the > loading never finishes and the server never reconciles. It happened to > me when three of my partners' nameservers went out of service. Making > the lookup as needed solves this problem. Yes, the reliance upon functional DNS is good. But not the looking up in the main flow of control. This is where things get very sticky very quickly. As is, doesn't your patch lead to a recon connection from a non-peer while one of your peers is without DNS being a mini-DoS attack? So once you have a peer with bad DNS, you become susceptible to recon service DDoS? When there's no DNS for a peer, and you try to find the DNS, then the local DNS cache will respond quickly for a period of time which is the negative cache TTL imposed by that server for SERVFAIL caching. So you go from the old scenario, where you're hung up every membership_reload_interval/mtime-changed period, which is O(hours) to hung up every negative cache-entry TTL expiry, which is O(minutes). Provided you only get recon connections from peers, this only bites when you get gossip from a peer with bad DNS. Which isn't going to be too often, but still more often than the old reload interval. If you also get recon connections from non-peers, suddenly your recon thread is hung up at the whim of anyone willing to issue a connection every few minutes. Fortunately, the level of impact only scales up with the number of peers with bad DNS, so you'll still *mostly* be serving. Thus while your patch is clearly trying to do the right thing, I think it's a step backwards in resilience. (One more than offset by your memory usage stability patch, but still ...) The clearest way out of this is to require dbserver/reconserver to have event handler callbacks for DNS, use asynchronous DNS callback resolution; populate membership with None entries and at load/reload fire off lookup for these. During connection check, if an IP entry is None and the last reload was more than N seconds ago (!Settings knob, default to 3. ?) then (1) fire off another async DNS resolution and then (2) return failure immediately, so that the peer gets penalised for flaky DNS and your server isn't hanging in the main flow of control. The gotcha here is async DNS support in O'Caml. I found an announcement for an O'Caml async DNS library called netdns: http://groups.google.com/group/fa.caml/browse_thread/thread/7bd2ae0a9415340d?pli=1 http://oss.wink.com/netdns/ which is BSD-licensed and at version 0.1. It's main documented incompleteness s that it requires a full resolver -- which is what we want here anyway. In addition, from looking at it: it doesn't support AAAA records it uses incremental xids and I think it's using a constant source port, so you'd really want to be using a localhost resolver; even then, since it's not matching source port, you're vulnerable. I don't think this library is ready for use. There's then "adns", which is GPL'd, with bindings in some languages but not O'Caml (does include Haskell); "c-ares" which is BSD licensed, does include IPv6 support, is widely used and I'm pretty sure it will have dealt with the xid/port attacks. There are various smaller libraries too, which don't manage to keep their websites working. But all of these options will require wrapping the C calls with O'Caml and tying into the event system -- I frankly lack the knowledge in O'Caml to even estimate how much work this is. So, in short, you've bitten off a bigger problem. -Phil
pgpBMzJvP6mfp.pgp
Description: PGP signature
_______________________________________________ Sks-devel mailing list Sks-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/sks-devel