Hello.
From: Motoharu Kubo <[EMAIL PROTECTED]>
Subject: I18n and l10n (Re: Charset normalization issue (report, patch, and
request))
Date: Sun, 15 Jan 2006 11:00:11 +0900
> I changed Subject.
Justin-san, John-san, and Motoharu-san, thanks a lot.
I think SA's message porcessing is:
raw mail -> [full] -> header part -> mime decoding -> [header] -+
|(splitting) |
+-> body part -> mime decoding -> [rawbody] -+ |
| |
+--------------------------------------------+ |
| V
+-> converting html -> [body] ------------------->+
to plain text |
+-------------------------------------------------+
|
+-> tokenization -> [bayes]
# For properly watching, please use fixed font.
Is the above flow drawing correct or wrong?
And, John-san and Motoharu-san's patches are:
| |
+--------------------------------------------+ |
| (NEW!) V
+-> converting html -> UTF-8 character -> [body]->+
to plain text normalization |
+-------------------------------------------------+
| (NEW!)
+-> tokenization -> [bayes]
by Mecab
Character normalization process is inserted before [body] testing.
So, Motoharu-san's patch is able to write Japanese character matching
rule directly in an user_prefs.
BTW, I wrote detecting character codeset rule:
body UTF8
/(([\xe0-\xef][\x80-\xbf][\x80-\xbf])(?!([\x81-\x9f\xe0-\xfc][\x40-\x7e\xc0-\xfc]|[\x81-\x9f\xf0-\xfc][\x40-\x7e\xc0-\xfc]|[\xc0-\xfc][\x40-\x7e\xc0-\xfc]))){5,}/
body SJIS_C
/(([\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc])(?!([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-\xbf]|[\xa1-\xfe][\xa1-\xfe]))){7,}/
Many Japanese spams are written in Shift-JIS codeset.
Shift-JIS detecting rule is convenience.
But, if the character normalization will insert before body testing,
my rule will be unavailable.
Do I have to re-write the above 2 rules from [body] to [rawbody]?
--
Matsuda Yoh-ich(yoh)
mailto:[EMAIL PROTECTED]
http://www.flcl.org/~yoh/diary/