-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Aaron,
On 03/23/2010 04:05 PM, Aaron Hopkins wrote: > On Tue, 23 Mar 2010, W.C.A. Wijngaards wrote: > >> The performance scales up fairly neatly as multi-threading goes. For >> every configuration a slower-than-linear speedup is observed, indicating >> locks in the underlying operation system network stack. > > There was no lock contention within unbound? I don't know how to measure > this on Solaris, but did you? Yes it is visible. The no-threads version of unbound has no lock code in it (macroed away), and thus has no lock contention. It has a slightly better graph than the versions with locks (maybe a 5% difference at 4 cores). So there is contention in unbound. In this example, with all queries for the same cache element, the contention should be as high as it gets, I think. >> There is only one network card, after all, and the CPUs have to lock and >> synchronise with it. > > This should be true even with multiple processes, however. Yes, this is what we see in the no-threads results. Those use processes. But they still bind to the same port 53 socket. > This maybe not be true for Solaris, but you might try having unbound listen > on multiple ports and spread requests across them and see if it matters. Yes, I have tried this. I got 2 more test machines to send queries from, and modified unbound to open (num_threads)x UDP ports and every Nth worker listens to UDP port N. A control check, with four perfs running towards unbound. evport, forked, 4senders: 9619 15860 19010 21979 evport, forked, 2senders: 9700 17300 19600 22300 Similar, slightly slower. The special version where every process listens to its own UDP port, and the perfs all run towards one port. evport, forked, process0 and perf0 use port 30053, process1 and perf1 use port 30054, process2 and perf2 use port 30055, process3 and perf3 use port 30056. evport, forked, special: 10000 18783 23461 25797 This is faster. It is not linear. In this test unbound has forked processes that do not lock mutexes or any pthread stuff. They all have a copy of the same file-descriptor table. But the list of fds passed to evport is different (same TCP, but different UDP) for every process. There are also some pipes in the background for interprocess comm but those are silent during the test. > The last time I looked, recent-ish Linux 2.6 still had per-socket locking > even in the face of multiple network cards. This means that multiple > threads or even multiple processes sharing a UDP socket can't really exceed > one CPUs worth of raw sendto() performance sourced from the same socket. > You can get much closer to linear scalability by binding to a different > port or IP per CPU. Not sure it is worth it. Maybe some modifications can be made to the UDP stack to make it more linear, but I do not know how. Best regards, Wouter -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAkup5LoACgkQkDLqNwOhpPh0GgCfTq/WtECivaSB3/hDFL0CnGCR /eAAoIlpY+xUmBIdzKxkSKZIDl5HBUaz =LRgF -----END PGP SIGNATURE----- _______________________________________________ Unbound-users mailing list [email protected] http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
