On Wednesday 21 January 2004 12:09 am, John August wrote: > This just an idea, in the tradition of 'I've got a good idea and hope > someone else will carry it through'. I don't expect it, but thought I'd > throw it in :) > > I've noticed a lot of spam which tries to dilute scanners by including a > lot of strings of random characters put together as words, or real words > strung together. > > for example : > > ibkcd ngf dfvfjq > > While generated randomly, this has some un-phonemic bits : 'fv' , 'jq' etc; > even though its generated randomly, known unphonemic two letter sequences > must turn up quite frequently (while you wouldn't actually trigger from > whole random words such as 'dfvfjq'). > > Presumably, Bayesian approaches might pick these up automatically, but > an intelligent approach would probably be more efficient. Perhaps > even using ideas about phonemes and how they fit together (which I'm not > familiar with).
SA already has a list of all letter-triplets found in a common list of English words. It's used to detect gibberish words at the end of subjects, used as unique IDs, but I suppose it could be used to detect anti-Bayesian gibberish. -- Give a man a match, and he'll be warm for a minute, but set him on fire, and he'll be warm for the rest of his life. Advanced SPAM filtering software: http://spamassassin.org ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk