On Wednesday 21 January 2004 12:09 am, John August wrote:
> This just an idea, in the tradition of 'I've got a good idea and hope
> someone else will carry it through'. I don't expect it, but thought I'd
> throw it in :)
>
> I've noticed a lot of spam which tries to dilute scanners by including a
> lot of strings of random characters put together as words, or real words
> strung together.
>
> for example :
>
> ibkcd ngf dfvfjq
>
> While generated randomly, this has some un-phonemic bits : 'fv' , 'jq' etc;
> even though its generated randomly, known unphonemic two letter sequences
> must turn up quite frequently (while you wouldn't actually trigger from
> whole random words such as 'dfvfjq').
>
> Presumably, Bayesian approaches might pick these up automatically, but
> an intelligent approach would probably be more efficient. Perhaps
> even using ideas about phonemes and how they fit together (which I'm not
> familiar with).

SA already has a list of all letter-triplets found in a common list of English 
words.  It's used to detect gibberish words at the end of subjects, used as 
unique IDs, but I suppose it could be used to detect anti-Bayesian gibberish.

-- 
Give a man a match, and he'll be warm for a minute, but set him on
fire, and he'll be warm for the rest of his life.

Advanced SPAM filtering software: http://spamassassin.org



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to