From: "Jay Sekora [via SpamAssassin]" I forgot to comment on this:
> Seems like just normalizing them to U+NNNN might be better than > trying to transcribe them. (And that would let a brave or foolhardy > mail administrator write rules to match patterns seen in, say, > Chinese-language spam even without knowing Chinese, or even without > knowing what language the spam was in.) You can do that already with the UTF8 normalizing. You can write your rules directly in Unicode characters, or in numeric codes if you want - Perl regular expressions accept that without any problems. So you do not need any ASCII normalizing for that, it works already well with the UTF8 normalizing. Unfortunately the UTF-8 normalizing does not solve at all the problem with different ways to use (or not to use) diacritics. In any European language (except of English, and to certain extent the Dutch), there is a high number of variants people are able to write the same word with or without diacritics. It makes the rules development quite difficult. You cannot write a simple rule to match a single word. You either have to use plenty of wildcards or piping all the variants into the regex. The UTF-8 normalizing also does not help with text obfuscation through diacritics or through visually similar Unicode characters. Each Latin character has many versions with various diacritics, or similarly looking Latin or non-Latin characters. I did not make the exact statistics, but there may be easily 20 or perhaps even more variants of each Latin character. If there were 20 variants for each character, you would have over 3 millions of permutations for each single 5-letter word. Without some kind of rather aggressive reduction of the variants (such as the 7bit US-ASCII normalizing), you would have hard time to write rules to match those obfuscations. -- View this message in context: http://spamassassin.1065346.n5.nabble.com/Current-best-practices-around-normalize-charset-tp105840p108558.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.