Andy Dills wrote:
On Mon, 8 Jan 2007, Jorge Valdes wrote:
I do understand that in large environments, optimizations have to be made in
order not to kill server performance, and expiration is probably something
that could be done at "more convenient times". I will commit a script that
can safely be run as a cronjob soon.
Excellent.
I understand that the "order" keyword in select is potentially expensive, but
necessary because matches occur generally towards the most recent entries,
thus increasing the possibility of a match earlier on. When your hash count
is in the thousands, earlier matches mean less queries to the database, and
potentially faster results.
It's not just the order directive, it's the iteration throughout the
entire database.
Consider when the database grows to >50k records. For a new image that
doesn't have a hash, that's 50k records that must be sorted then sent from
the DB server to the mail server, then all 50k records must be checked
against the hash before we decide that we haven't seen this image before.
That just isn't a workable algorithm. If iteration throughout the entire
database is a requirement, hashing is a performance hit rather than a
performance gain.
A better solution might be a seperate daemon that holds the hashes in
memory, to which you submit the hash being considered.
Honestly, I have been extremely impressed with having hashing turned
completely off.
Andy
---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---
Right now my DB is ~21K records, and I expire after 21 days... I could
always reduce the size of the DB by expiring sooner. The default value
is set for 35 days, a little over one month (5 full weeks), so tunning
this value could help you out. After looking at my logs, ~2/3 of the
matches happen within 24hrs, so just keeping matches for 24 hours will
get me 2/3 of the way, as you can always rescan the images from the
other 1/3 of the messages and will probably be faster than looking for
the database match in your case.
Remember, when working in large environments, optimization of resources
is key, so here are a few suggestions:
+ expiring the DB after only 1-3 days may be the optimal setting for
you, since this will reduce the number of records in the DB and still
reap the benefits of saving the hashes. Check your logs...
+ use BerkeleyDB on a Ramdisk will certainly be faster, just make sure
that the Ramdisk will not run out of free space (not generally a
problem). Also, remember to save a copy from Ramdisk to Harddisk
periodically in order to keep backups in case of a system restart, or
you will loose the DB.
+ tune your MySQL setup, including but not limited to adding another
index: 'Hash.check' in order to reduce the sorting times, allocation
more RAM for sorting, etc.
+ use a dedicated MySQL server if you use the MySQL solution to share
the database among several SMTP servers, possibly using this server for
other common tasks as well.
Remember, that the solution you may implement depends largely on the
resources you have available. There are other solutions that can help
you reduce the amount of work sent to the plugin, the Botnet plugin
helps a lot, and setting *focr_autodisable_score* to a value more suited
to your situation (default: 10), since most people set it to a higher
value in order to test the plugin and never reset this value.
Jorge.