Re: [spamdyke-users] Databases revisited

Eric Shubert Thu, 22 Oct 2009 12:26:28 -0700

Nice piece, Sam.

In addition, the OS will likely have cached spamdyke's config file(s) 
anyhow, so I expect any real performance gain would be negligible.


BL to me is that there are a host of other inefficiencies (pardon the 
pun) that would bring a mail server to its knees long before 
optimization of spamdyke's config files could provide any relief.

Sam Clippinger wrote:
> Personally, I like the second option (adding options with "-cdb" for CDB 
> files) rather than the first one (requiring a specific naming scheme).
> 
> I've already implemented CDB support in the code for the next version, 
> so spamdyke can read some of qmail's control files for recipient 
> validation.  Adding CDB support to other options wouldn't take much 
> extra effort.  The big question, of course, is whether it's worth it.
> 
> I know DJB says CDB files are the bee's knees but I must say (after 
> reading his docs, his source code and writing my own code for spamdyke) 
> that I'm not impressed.  I'm sure they're more efficient than text files 
> for large amounts of data (hundreds of thousands of entries).  But for 
> small data sets (hundreds of entries) I don't believe they're any more 
> efficient and for tiny data sets (ten entries) they are hugely 
> wasteful.  When you consider the additional headache of having to keep 
> the CDB file in sync with the ASCII source, I really don't see the point.
> 
> Of course I haven't benchmarked anything, so I could be way off base.  
> DJB has a PhD and teaches computer science, I don't.  He probably 
> analyzed his hash functions to minimize collisions and compared 
> operational complexities and so forth... academics do that kind of stuff 
> for fun.  In a nutshell, here's how a CDB file is accessed:
>     Calculate hash
>     Seek to position within CDB, read 64 bytes of data (primary hash table)
>     A few more calculations
>     Seek to another position within CDB, read another 64 bytes of data 
> (secondary hash table)
>     A few more calculations
>     Seek to a third position within the CDB, read another 64 bytes of 
> data (header entry)
>     Compare the header entry to the desired data
>     If it matches, seek to a fourth position within the CDB, read the 
> data record
>     If it does not match, go back to the secondary hash table and look 
> in the next "slot" for your data. Repeat until your data is found.
> 
> Except for the secondary hash table, which I don't see a need for, this 
> describes a textbook hash table from freshman computer science classes.  
> The seek/read operations are the most expensive operations (the math 
> takes no time at all) because they require the program to wait for 
> access to a spinning disk.  If everything goes well and there are no 
> hash collisions, reading a single entry from a CDB file requires 4 
> separate seek/read operations within the file.  If things go badly and 
> there are hash collisions, reading an entry from a CDB file may take 
> many more read/seek operations (theoretically it could read the entire 
> file).  By comparison, when spamdyke reads a text file, it loads 64 KB 
> at a time (if possible) and parses the lines in memory.  This is a win 
> when the file is small or the entry is near the beginning.  It's a huge 
> win when the file is tiny (like most /etc/tcp.smtp files).
> 
> So I said all that to say this: I don't personally believe CDB files 
> live up to the hype, nor do I believe they solve any real-world problems 
> (they're still binary formats, they can't be shared between servers, 
> etc) but if people want them I can support them.
> 
> -- Sam Clippinger
> 
> lenn...@wu-wien.ac.at wrote:
>> Dear all,
>>
>> I have been reading up on the discussions on this list as well as the
>> concerns about databases in the FAQ. Whilst I concur with most of the
>> points wrt. to a fully fledged SQL database, I think that CDBs are
>> ideally suited for the purposes of spamdyke. Sam states in the FAQ
>> that speed, memory, concurrency, portability and availability are not
>> a concern with CDBs and I agree, especially on the speed issue. After
>> all, that was what the hash file format was designed for. 
>>
>> That leaves accessibility and safety for CDBs. It is true that the
>> database itself is in binary form (that is where the speed comes
>> from), which means that they cannot be easily viewed and checked for
>> errors. At the same time, they are read only and are usually generated
>> from a plain text file as input. There is no reason to not have that
>> text file sitting next to the actual database file, which means we
>> have all the advantages of a plain text file plus the speed benefit of
>> CDBs, which can be substantial for a lot of entries. The only
>> additional step required (by the admin) would be to convert the text
>> file into the CDB. We could also have the best of both worlds like
>> this. Suppose we have this entry in the configuration file:
>>
>> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
>>
>>
>> First, we look for a file with the name
>> /etc/spamdyke/recipient-blacklist.cdb. If it exists, we assume it is a
>> CDB version of /etc/spamdyke/recipient-blacklist and look up whatever
>> we need there. If recipient-blacklist.cdb has an earlier modification
>> time than recipient-blacklist (we get that for free anyway with a
>> stat() on both files), we could help the admin by printing a warning
>> that the CDB is probably out of date and read from recipient-blacklist
>> instead. If recipient-blacklist.cdb does not exist, we use
>> recipient-blacklist in ASCII format like before.
>>
>>
>> Another version of this would be to have lots of new configuration
>> options like:
>>
>> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
>>
>> That makes it possible to name the database file arbitrarily. If we
>> want the safety checks like in the example above we could make it
>> mandatory to name the ASCII input file for the CDB database file:
>>
>> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
>> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
>>
>> That way all the fallbacks to ASCII plus warnings can be implemented at
>> the cost of more configuration entries.
>>
>>
>> What do you think?
>>
>>   


-- 
-Eric 'shubes'

_______________________________________________
spamdyke-users mailing list
spamdyke-users@spamdyke.org
http://www.spamdyke.org/mailman/listinfo/spamdyke-users

Re: [spamdyke-users] Databases revisited

Reply via email to