Just to follow up on this. I'm in the process of improving the filter. But I have filed my provisional patent so i'm going to give you an overview of how it works.

Most spam filters work by matching things. Matching ham and spam. Matching rules. The important point here in this is this new system I'm calling the Evolution filter is about NOT matching.

Suppose I sent you an email with the subject line "Let's get dinner". You can tell instantly this is good email. How? Because spammers never say "Let's get dinner".

There are millions of phrases used in good email every day that are never used in spam. And - there are millions of phrases used everyday in spam that are never used in good email. So if I get an email that matches phrases used in good email and never used in spam - it's a good message. And if the messages contains words and phrases used in spam and never used in ham - it's spam.

So - how do I get a list of all phrases never used in ham or never used in spam? I make a list of all words and phrases used in ham and spam and test to see if it's NOT in the list. To illustrate my point,

Here is a list of 5505874 words and phrases used in the subject line of HAM and never seen in the subject line of SPAM

http://www.junkemailfilter.com/data/subject-ham.txt

Here is a list of 3494938 words and phrases used in the subject line of SPAM and never seen in the subject line of HAM

http://www.junkemailfilter.com/data/subject-spam.txt

The thing about not matching is that matching involves finite sets. Not matching involves infinite sets. And infinite sets are always bigger than finite sets.

Here in a link to my patent.

http://www.junkemailfilter.com/patent/

What I intend to do is to give it away to the little guys and charge the big guys a small license fee. The process of implementing this is fairly easy. I'm hoping to encourage the open source world to take this idea and do it right. My code it cobbled together and uses 4 different languages. But the concept is enough to get you going.

One thing you will need to implement this is Redis. Redis is extremely fast at set comparisons and set comparisons is how this works. It's can be expressed as one formula.

score = card(SpamCorpus intersect TestMessage diff HamCorpus) - card(HamCorpus intersect TestMessage diff SpamCorpus)

I'm seeing an accuracy level that is so close to 100% it's scary. It is especially good at actively identifying good email to prevent false positives.

I will post more soon as it all comes together.




_______________________________________________
mailop mailing list
mailop@mailop.org
https://chilli.nosignal.org/cgi-bin/mailman/listinfo/mailop

Reply via email to