Nice piece, Sam. In addition, the OS will likely have cached spamdyke's config file(s) anyhow, so I expect any real performance gain would be negligible.
BL to me is that there are a host of other inefficiencies (pardon the pun) that would bring a mail server to its knees long before optimization of spamdyke's config files could provide any relief. Sam Clippinger wrote: > Personally, I like the second option (adding options with "-cdb" for CDB > files) rather than the first one (requiring a specific naming scheme). > > I've already implemented CDB support in the code for the next version, > so spamdyke can read some of qmail's control files for recipient > validation. Adding CDB support to other options wouldn't take much > extra effort. The big question, of course, is whether it's worth it. > > I know DJB says CDB files are the bee's knees but I must say (after > reading his docs, his source code and writing my own code for spamdyke) > that I'm not impressed. I'm sure they're more efficient than text files > for large amounts of data (hundreds of thousands of entries). But for > small data sets (hundreds of entries) I don't believe they're any more > efficient and for tiny data sets (ten entries) they are hugely > wasteful. When you consider the additional headache of having to keep > the CDB file in sync with the ASCII source, I really don't see the point. > > Of course I haven't benchmarked anything, so I could be way off base. > DJB has a PhD and teaches computer science, I don't. He probably > analyzed his hash functions to minimize collisions and compared > operational complexities and so forth... academics do that kind of stuff > for fun. In a nutshell, here's how a CDB file is accessed: > Calculate hash > Seek to position within CDB, read 64 bytes of data (primary hash table) > A few more calculations > Seek to another position within CDB, read another 64 bytes of data > (secondary hash table) > A few more calculations > Seek to a third position within the CDB, read another 64 bytes of > data (header entry) > Compare the header entry to the desired data > If it matches, seek to a fourth position within the CDB, read the > data record > If it does not match, go back to the secondary hash table and look > in the next "slot" for your data. Repeat until your data is found. > > Except for the secondary hash table, which I don't see a need for, this > describes a textbook hash table from freshman computer science classes. > The seek/read operations are the most expensive operations (the math > takes no time at all) because they require the program to wait for > access to a spinning disk. If everything goes well and there are no > hash collisions, reading a single entry from a CDB file requires 4 > separate seek/read operations within the file. If things go badly and > there are hash collisions, reading an entry from a CDB file may take > many more read/seek operations (theoretically it could read the entire > file). By comparison, when spamdyke reads a text file, it loads 64 KB > at a time (if possible) and parses the lines in memory. This is a win > when the file is small or the entry is near the beginning. It's a huge > win when the file is tiny (like most /etc/tcp.smtp files). > > So I said all that to say this: I don't personally believe CDB files > live up to the hype, nor do I believe they solve any real-world problems > (they're still binary formats, they can't be shared between servers, > etc) but if people want them I can support them. > > -- Sam Clippinger > > lenn...@wu-wien.ac.at wrote: >> Dear all, >> >> I have been reading up on the discussions on this list as well as the >> concerns about databases in the FAQ. Whilst I concur with most of the >> points wrt. to a fully fledged SQL database, I think that CDBs are >> ideally suited for the purposes of spamdyke. Sam states in the FAQ >> that speed, memory, concurrency, portability and availability are not >> a concern with CDBs and I agree, especially on the speed issue. After >> all, that was what the hash file format was designed for. >> >> That leaves accessibility and safety for CDBs. It is true that the >> database itself is in binary form (that is where the speed comes >> from), which means that they cannot be easily viewed and checked for >> errors. At the same time, they are read only and are usually generated >> from a plain text file as input. There is no reason to not have that >> text file sitting next to the actual database file, which means we >> have all the advantages of a plain text file plus the speed benefit of >> CDBs, which can be substantial for a lot of entries. The only >> additional step required (by the admin) would be to convert the text >> file into the CDB. We could also have the best of both worlds like >> this. Suppose we have this entry in the configuration file: >> >> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist >> >> >> First, we look for a file with the name >> /etc/spamdyke/recipient-blacklist.cdb. If it exists, we assume it is a >> CDB version of /etc/spamdyke/recipient-blacklist and look up whatever >> we need there. If recipient-blacklist.cdb has an earlier modification >> time than recipient-blacklist (we get that for free anyway with a >> stat() on both files), we could help the admin by printing a warning >> that the CDB is probably out of date and read from recipient-blacklist >> instead. If recipient-blacklist.cdb does not exist, we use >> recipient-blacklist in ASCII format like before. >> >> >> Another version of this would be to have lots of new configuration >> options like: >> >> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb >> >> That makes it possible to name the database file arbitrarily. If we >> want the safety checks like in the example above we could make it >> mandatory to name the ASCII input file for the CDB database file: >> >> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist >> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb >> >> That way all the fallbacks to ASCII plus warnings can be implemented at >> the cost of more configuration entries. >> >> >> What do you think? >> >> -- -Eric 'shubes' _______________________________________________ spamdyke-users mailing list spamdyke-users@spamdyke.org http://www.spamdyke.org/mailman/listinfo/spamdyke-users