> I'm going to propose you another great idea which will > probably radically change the spam-detection technics. > > No, come one: I'm just kitting. :) I think this "idea" could > eventually help in better detecting the kind of spam in which > some words are "garbled" in order to deceive their detection. > > Some of you probably already know that there exists > alghoritms devoted to detecting the language in which a text > is written. I just discovered the paper in > http://www.sfs.uni-tuebingen.de/iscl/Theses/kranig.pdf , > which by the way says that such detectors are already > available as Perl modules in CPAN (see chapter 7). > > The idea is that, applying this alghoritms to the text in a > message, one could eventually obtain the probability that the > given text is written in a given language. Let say that a > text is written in english, then these perl routines should > yield a high probability that the given text is english. Now, > say that some of the words in that text are somehow > "scrambled". The language detectors would probably decrease > the probability that the text is in english but, assuming the > words are randomly scrambled, the probability that the text > is in another language wouldn't increase, too. Now, we could > apply some thresholding to language scores such that, when > the score of the probable language is below a given threshold > above the mean of the language scores, then we could say that > the message contains some "scrambled worlds" and apply a > penalty score to it. > > I know there are scores for scrambled versions of words like > "cialis", but this method would be more solid with respect to > non-english languages: I'm from Italy, and I'm used to see > some FPs on italian words like "via galileo" as being a > scrambled version of "viagra". Also, attempting to collect > all the good versions of spam words is expensive in terms of effort. > > Please note that: > > - language decoding doesn't (actually) work for ideomatic > languages (chinese, japanese, korean and such); > > - I didn't even have a run of the language decoding modules; > > - a message written in many (> 3, 4?) languages may probably > trigger the penalty score. > > I'm just trying to see if such an idea seems definitely > "broken" to you, as well as if anybody did altready try to > run into this.
What happens with computer lingo and things like URLs that aren't really language? I guess the idea would be to write it and see what such a rule would hit. Bret
smime.p7s
Description: S/MIME cryptographic signature