On 08/22/16 07:28, Dianne Skoll wrote:
On Mon, 22 Aug 2016 07:16:41 -0700
Marc Perkel <supp...@junkemailfilter.com> wrote:
Anthony, Yes - I don't store Set B. I store Set A. B is defined by
what's NOT in A. So I test A and if it's not matched it's set B. Set
B is just a negative match on A.
Let me ask you a question. As far as I understand your algorithm, if
an email contains at least one token in the "ham" set and zero tokens in
the "spam" set, you classify it as ham. And conversely, if it contains
at least one spam token but zero ham tokens, you classify it as spam.
YES! YES! YES!
Although I look at some thousand "fingerprints" to get a more
significant result.
The other two possibilities (no tokens in either or some tokens in both)
are undecidable.
Exactly!
So. What percentage of emails using your algorithm are actually decidable?
Almost 100% if you look at a wide variety of tokens from multiple
attributes. Subject, body, content flags, header structure, combinations
of all domains reference, php scripts, name part of from addresses,
behavior flags.
Regards,
Dianne.
--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400