Re: Site-wide bayes and individual bayes

Ted Mittelstaedt Sun, 12 Oct 2014 13:56:14 -0700


On 10/12/2014 9:59 AM, LuKreme wrote:

On 10 Oct 2014, at 06:49 , RW<rwmailli...@googlemail.com>  wrote:

And, if not, is it generally better to do sitewide?


It's hard to say, there are advantages and disadvantages either
way.


OK, so specific example then.

Small server with a few dozen email users spread over several
domains. Almost none of these users does any spam training at all,
the rest just delete unwanted messages (not even marking them as
junk) or even worse, just ignore them. One user is very aggressive in
marking Spam and in keeping the Inbox clear of all spam.

I am of two minds. First, that everyone else would benefit from this
user’s actions or, alternatively, that the user’s aggressive tagging
will actually ‘poison’ the bayes db for the other users who maybe do
not think that endless emails from pinterest or some political
candidate are actually spam.


For starters your problem isn't SPAM it's HAM.

You can get all the spam you want.  Just parse the mail log file every

day for a few weeks, looking for delivery attempts to nonexistentmailboxes. When you see repeated delivery attempts to a specificmailbox then create an email address on that nonexistent mailbox andredirect all the email into it into a spam box


My experience is that once spammers think they have "discovered" an
email address they will never leave it alone, they will send increasing
amounts of spam to that address.

If you are lucky enough to never have spammers trying to probe your
server, you can create your honeypot email addresses, just make them up,

and then take these email addresses and post them into the Unsubscribelinks on spam. That is a good way to contaminate spammers mailing lists

with honeypot addresses.  A legitimate mailsender will ignore these, a
spammer will happily pull addresses out of unsubscribe replies.

That's your centralized spam source. Do this for a couple dozennonexistent email addresses on your server domains and you will have

all the input you want for the Bayes learner.

By definition ANY email to a nonexistent address (not an old address
that was closed down years ago) is unsolicited, AKA SPAM.

As for desired political mail, on my servers I classify all of it as
spam, I can think of maybe only 2 users over the last decade who have
complained about not getting it and for those it's easy to do an
all_spam_to to them and then tell them they will have to do their own
spam filtering.

Since overwhelmingly the political email I have seen coming in is the
offensive conservative anti-women, anti-blacks, anti-latinos, beg for

more money email, I have to say that I'm not particularly concernedabout the wishes of customers who WANT that kind of mail - I'm quite

happy if they go find another provider.

And, naturally, that kind of email is never ever appropriate for a

business and no employee in a business is ever going to dare complain totheir bosses that they aren't getting it.


If the politicos want to drown people in hate mail, they have paper
mail to do it - might as well make them help reduce my taxes by

subsidizing the US Post Office with their hate mail, that's about theonly thing that's good about it.

Anyway, as I said HAM is the problem. If you don't have largequantities of ham, Bayes won't work. Of course, nothing is preventing

you from copying people's folders  (if they are using IMAP) into one
giant mailbox and using that as a HAM source.  You can probably assume
that if a user has gone to the trouble of saving mail to a folder that
it is ham.

Ted

Re: Site-wide bayes and individual bayes

Reply via email to