From: "Jay Sekora [via SpamAssassin]" 

I forgot to comment on this:

> Seems like just normalizing them to U+NNNN might be better than 
> trying to transcribe them.  (And that would let a brave or foolhardy 
> mail administrator write rules to match patterns seen in, say, 
> Chinese-language spam even without knowing Chinese, or even without 
> knowing what language the spam was in.)

You can do that already with the UTF8 normalizing. You can write your rules 
directly in Unicode characters, or in numeric codes if you want - Perl 
regular expressions accept that without any problems. So you do not need any 
ASCII normalizing for that, it works already well with the UTF8 normalizing.

Unfortunately the UTF-8 normalizing does not solve at all the problem with 
different ways to use (or not to use) diacritics. In any European language 
(except of English, and to certain extent the Dutch), there is a high number 
of variants people are able to write the same word with or without 
diacritics. It makes the rules development quite difficult. You cannot write 
a simple rule to match a single word. You either have to use plenty of 
wildcards or piping all the variants into the regex. 

The UTF-8 normalizing also does not help with text obfuscation through 
diacritics or through visually similar Unicode characters. Each Latin 
character has many versions with various diacritics, or similarly looking 
Latin or non-Latin characters. I did not make the exact statistics, but there 
may be easily 20 or perhaps even more variants of each Latin character. If 
there were 20 variants for each character, you would have over 3 millions of 
permutations for each single 5-letter word. Without some kind of rather 
aggressive reduction of the variants (such as the 7bit US-ASCII normalizing), 
you would have hard time to write rules to match those obfuscations.





--
View this message in context: 
http://spamassassin.1065346.n5.nabble.com/Current-best-practices-around-normalize-charset-tp105840p108558.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Reply via email to