Hello.

From: Motoharu Kubo <[EMAIL PROTECTED]>
Subject: I18n and l10n (Re: Charset normalization issue (report, patch, and 
request))
Date: Sun, 15 Jan 2006 11:00:11 +0900

> I changed Subject.

Justin-san, John-san, and Motoharu-san, thanks a lot.

I think SA's message porcessing is:

raw mail -> [full] -> header part -> mime decoding -> [header] -+
              |(splitting)                                      |
              +-> body part -> mime decoding -> [rawbody] -+    |
                                                           |    |
              +--------------------------------------------+    |
              |                                                 V
              +-> converting html -> [body] ------------------->+
                  to plain text                                 |
              +-------------------------------------------------+
              |
              +-> tokenization -> [bayes]

# For properly watching, please use fixed font.

Is the above flow drawing correct or wrong?
And, John-san and Motoharu-san's patches are:

                                                           |    |
              +--------------------------------------------+    |
              |                           (NEW!)                V
              +-> converting html -> UTF-8 character -> [body]->+
                  to plain text        normalization            |
              +-------------------------------------------------+
              |      (NEW!)
              +-> tokenization -> [bayes]
                   by Mecab

Character normalization process is inserted before [body] testing.
So, Motoharu-san's patch is able to write Japanese character matching
rule directly in an user_prefs.

BTW, I wrote detecting character codeset rule:

body UTF8      
/(([\xe0-\xef][\x80-\xbf][\x80-\xbf])(?!([\x81-\x9f\xe0-\xfc][\x40-\x7e\xc0-\xfc]|[\x81-\x9f\xf0-\xfc][\x40-\x7e\xc0-\xfc]|[\xc0-\xfc][\x40-\x7e\xc0-\xfc]))){5,}/

body SJIS_C 
/(([\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc])(?!([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-\xbf]|[\xa1-\xfe][\xa1-\xfe]))){7,}/

Many Japanese spams are written in Shift-JIS codeset.
Shift-JIS detecting rule is convenience.

But, if the character normalization will insert before body testing,
my rule will be unavailable.

Do I have to re-write the above 2 rules from [body] to [rawbody]?
--
Matsuda Yoh-ich(yoh)
mailto:[EMAIL PROTECTED]
http://www.flcl.org/~yoh/diary/

Reply via email to