> I'm going to propose you another great idea which will 
> probably radically change the spam-detection technics.
>       
> No, come one: I'm just kitting. :) I think this "idea" could 
> eventually help in better detecting the kind of spam in which 
> some words are "garbled" in order to deceive their detection.
> 
> Some of you probably already know that there exists 
> alghoritms devoted to detecting the language in which a text 
> is written. I just discovered the paper in 
> http://www.sfs.uni-tuebingen.de/iscl/Theses/kranig.pdf , 
> which by the way says that such detectors are already 
> available as Perl modules in CPAN (see chapter 7).
> 
> The idea is that, applying this alghoritms to the text in a 
> message, one could eventually obtain the probability that the 
> given text is written in a given language. Let say that a 
> text is written in english, then these perl routines should 
> yield a high probability that the given text is english. Now, 
> say that some of the words in that text are somehow 
> "scrambled". The language detectors would probably decrease 
> the probability that the text is in english but, assuming the 
> words are randomly scrambled, the probability that the text 
> is in another language wouldn't increase, too. Now, we could 
> apply some thresholding to language scores such that, when 
> the score of the probable language is below a given threshold 
> above the mean of the language scores, then we could say that 
> the message contains some "scrambled worlds" and apply a 
> penalty score to it.
> 
> I know there are scores for scrambled versions of words like 
> "cialis", but this method would be more solid with respect to 
> non-english languages: I'm from Italy, and I'm used to see 
> some FPs on italian words like "via galileo" as being a 
> scrambled version of "viagra". Also, attempting to collect 
> all the good versions of spam words is expensive in terms of effort.
> 
> Please note that:
> 
>  - language decoding doesn't (actually) work for ideomatic 
> languages (chinese, japanese, korean and such);
> 
>  - I didn't even have a run of the language decoding modules;
> 
>  - a message written in many (> 3, 4?) languages may probably 
> trigger the penalty score.
> 
> I'm just trying to see if such an idea seems definitely 
> "broken" to you, as well as if anybody did altready try to 
> run into this.

What happens with computer lingo and things like URLs that aren't really
language? I guess the idea would be to write it and see what such a rule
would hit.

Bret

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to