On 01/20/16 10:44, Antony Stone wrote:
On Wednesday 20 January 2016 at 17:52:05, Marc Perkel wrote:

Suppose I get an email with the subject line "Let's get some lunch". I
know it's a good email because spammers never say "Let's go to lunch".
In fact there are an infinite number of words and phrases that are used
in good email that are never ever used in spam.
Surely this is going to change as soon as enough people implement your
filtering system - spammers will use legitimate phrases from ham, both in the
subject line and the body of their emails, and thereby get classified as ham?
Matching ham doesn't get you classified as ham. it's in not matching spam. Matching ham is neutral if spammers use it too.

At some point the spammer wants you to do something and if they immitate ham perfectly then they don't have a message and it's no longer spam. (Except I tokenize behavior as well)

So, you're identifying ham by checking that it does not contain words or
phrases which you have previously seen in spam...

Sounds very much like Bayes to me.

Bayes compares the new email to whats inside the ham and spam boxes. What I do is compare inside on one side and outside the box on the other. Bayes is about matching The Evolution filter is about NOT matching.


What I do is tokenize the spamiest parts of the email, like the subject
line
How do you identify "the spammiest parts" of an email?

The Subject line - the first few words of the email. the header structure, behavior. File extensions of attached files. Name part of from address, text inside links.


I'd like to see SA implement this.
I'm not going to share my code because my code is specific to my system and
it a combination of bash scripts, redis, pascal, php, and Exim rules. And
the open source programmers are likely to implement it better than I have.
Given that you have *some* source code, no matter how bad / buggy / specific it
is, I think you'll get much greater take-up (and also comprehension of exactly
what your technique is) if you at least publish that and invite people to
improve on it, rather than say "here's a method idea - you guys code it".

The heart of the code is what I do with Redis. It's just set operations.

Intersect Ham diff Spam to get ham matches.
Intersect Spam diff Ham to get spam matches.

Count the lines - Subtract the result - and you have a score.



I'm seeing close to 100% accuracy.
1. How close?
Less than 10 a day filtering 5000 domains.


2. On what volume of email?

1.3 million good emails last week.


3. What proportion of spam / ham?

About 10 spam to one ham. But I have a spam baiting system so I get more spam than normal.


4. What % false positives / negatives?
Especially good at identifying ham.


5. How many different domains' email are you feeding in to it?

6. How long have you been testing it (ie: how much have you seen of how it
adapts to new spam patterns)?

About 4 weeks now.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Reply via email to