On 01/20/16 10:36, John Hardin wrote:
On Wed, 20 Jan 2016, Marc Perkel wrote: .
So it still needs to be trained, at least initially, with a
manually-vetted corpus. If not, how do you propose to do the initial
classification of messages for training?
Do you envision it being self-training past that point? What if it
goes off the rails? How would you keep it from going off the rails?
If it's not self-training then you have the same issues with the
reliability of the people feeding the training corpus.
On my system I have a long list of good email sources that are 100%
white listed and I also have hackerbot traps that are 100% spam. I use
these for training to keep it on the rails. Good question though.
So I'm not just tokenizing the subject. Also the first 25 words of
the message
OK, good. I was thinking it would be *really* sensitive to "bayes
poisoning". Ignoring all but the first part of the body helps.
I assume you're only considering the portion that would render as
visible to the recipient. Of course, that brings in all the logic
regarding "what is visible to the recipient?" and all the HTML
obfuscation we're already seeing to get around Bayes and "only scan
the first part of the message".
Actually it's very insensitive to poisoning. Yes a spammer might cancel
out some good phrases every now and then but since my system does NOT
matching on one side it's not as sensitive as Bayes. If they poison with
the same phrases twice I have them.
--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400