On 01/20/16 10:36, John Hardin wrote:
On Wed, 20 Jan 2016, Marc Perkel wrote: .

So it still needs to be trained, at least initially, with a manually-vetted corpus. If not, how do you propose to do the initial classification of messages for training?

Do you envision it being self-training past that point? What if it goes off the rails? How would you keep it from going off the rails?

If it's not self-training then you have the same issues with the reliability of the people feeding the training corpus.

On my system I have a long list of good email sources that are 100% white listed and I also have hackerbot traps that are 100% spam. I use these for training to keep it on the rails. Good question though.


So I'm not just tokenizing the subject. Also the first 25 words of the message

OK, good. I was thinking it would be *really* sensitive to "bayes poisoning". Ignoring all but the first part of the body helps.

I assume you're only considering the portion that would render as visible to the recipient. Of course, that brings in all the logic regarding "what is visible to the recipient?" and all the HTML obfuscation we're already seeing to get around Bayes and "only scan the first part of the message".


Actually it's very insensitive to poisoning. Yes a spammer might cancel out some good phrases every now and then but since my system does NOT matching on one side it's not as sensitive as Bayes. If they poison with the same phrases twice I have them.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Reply via email to