From: "Ole Kasper Olsen" <[EMAIL PROTECTED]>

Hi,

I am a developer on a fairly large community site (30-50,000 active users) with blogs, photo albums and forums.

I spent yesterday tinkering with a spam prevension system which runs each new comment to a blog post or image in a photo album through SpamAssassin. I take the provided comment, and assemble a RFC822-compliant message based on the users IP address and sender and reciever's registered email addresses, and then run it through Mail::SpamAssassin (the Perl module) with default settings.

First off I'd join the spamassassin-users list at spamassassin.apache.org.
Then I'd post this message to the list.

I think this is a good basic idea, although the tool is not really
designed for this sort of thing. I suspect you will have trouble with
ALL_TRUSTED and a few other things if you do not include proper
Received: headers that would track the "path" via which the message
was received. Since you do not know the poster's ISP's smarthost
in all cases you can end up falsely triggering a lot of rules that are
based on things like dialup addresses for the postings. I believe, but
am not sure, that this phenomenon intrudes on the Bayes operations, too.

In the best of all possible worlds you'd need a very carefully pruned
set of rules and may end up having to manually train Bayes. (You might
want to make provisions for this in your scripting setup.) This can lead
to considerable processing time per message. SpamAssassin is hungry for
CPU cycles. (I run here with a very large number of rule sets on a 1.8GHz
Athlon system with a gigabyte of RAM. An average spam takes over 3 seconds
to get scanned. About half to three quarters of this time is CPU cycles.
This is highly dependant on rule sets chosen, of course.)

This seems to work. At least it intercepts the test-message provided in the SpamAssassin documentation.

This system requires me to have a utility where people can mark spam as ham in the case of SpamAssassin wrongly identifying a valid comment as spam. I was planning of having this utility teach the Bayesian filter on a community-wide basis, i.e. for all users. Therefore, people cannot mark their own messages as ham. This to guard against spammers teaching the filter wrongly.

 - Is learning a good idea at all in this setting?

I'm shooting from the hip on this one and have a bias for carefully
manually trained Bayes, at least at first. Also in general the learning
thresholds for spam and ham both need to be adjusted carefully.

Of course, "learning" is required not just a good idea. It's how you do
the learning that is at issue. Automated learning can be risky with bad
threshold values and inadequate initial training. Over time you could
probably move to automated training (and automated expires) safely
enough.

This leaves your potential real system poison, the auto-whitelist system.
Turn it off. You probably cannot afford it's misfires. Manually whitelist
those who must be whitelisted.

- If so, what are the advantages and more importantly disadvantages of having community-wide learning?

For a blog I'd break with my other strong bias for per user Bayes and
choose site-wide. All users should have the same "experience" in the
blogs. Otherwise you'll get a large "Hunh?" factor from people not
seeing the original post in a discussion chain.

   - Should I use autolearning?

See above. If you use it be very careful. Set thresholds wider than stock.
And do not even consider using auto-whitelisting.

- Is there anything else I should be aware of when implementing SpamAssassin in this setting?
   - Settings
   - Thresholds
   - &c?

Do not use the same SpamAssassin setup for both email and blog. If they
must run on the same machine check the man files and use alternate
configpath and siteconfigpath settings.

I'd be sure to use spamd/spamc rather than spamassassin itself. This
cuts down CPU requirements considerably. If mail runs on the same
machine run two spamd's with different pid storage and port numbers.

After testing this a bit on comments, I hope to expand to blog posts and forum posts as well, so that moderators gets a heads-up when people post spam.

This may work. It's not what SpamAssassin is designed to do. But you may
see a significant aid in perverting its use to both the blog and forum
usage. In both cases sitewide rules and Bayes are required, IMAO. However,
you MAY want different learning and custom rules for SOME forums and
blogs if the machine is running multiple blogs. In that case you have an
interesting setup challenge facing you. I believe it can be done if each
"entity" for which different rules are needed must be a "user" that has a
"/home/<entity>" directory and into which it's ".spamassassin" files can
be placed. (I'd start that process with the default shells for these users
as /bin/nologin or some such.)

As mentioned above, spamassassin-users is probably your best shot for
some good thought and help. But you might also get the authors saying
"This can't be done!" Emphasize you have it partially working and need
to fine tune the concept, ideally without having to spawn off a special
blog version of spamassassin with a different name.

{^_^}   Joanne

Reply via email to