Hi,
I am a developer on a fairly large community site (30-50,000 active users)
with blogs, photo albums and forums.
I spent yesterday tinkering with a spam prevension system which runs each
new comment to a blog post or image in a photo album through SpamAssassin.
I take the provided comment, and assemble a RFC822-compliant message based
on the users IP address and sender and reciever's registered email
addresses, and then run it through Mail::SpamAssassin (the Perl module)
with default settings.
This seems to work. At least it intercepts the test-message provided in
the SpamAssassin documentation.
This system requires me to have a utility where people can mark spam as
ham in the case of SpamAssassin wrongly identifying a valid comment as
spam. I was planning of having this utility teach the Bayesian filter on a
community-wide basis, i.e. for all users. Therefore, people cannot mark
their own messages as ham. This to guard against spammers teaching the
filter wrongly.
- Is learning a good idea at all in this setting?
- If so, what are the advantages and more importantly disadvantages of
having community-wide learning?
- Should I use autolearning?
- Is there anything else I should be aware of when implementing
SpamAssassin in this setting?
- Settings
- Thresholds
- &c?
After testing this a bit on comments, I hope to expand to blog posts and
forum posts as well, so that moderators gets a heads-up when people post
spam.
--
Ole Kasper Olsen
Information Systems Developer
Opera Software ASA