Re: I18n and l10n

Motoharu Kubo Mon, 16 Jan 2006 09:23:02 -0800

MATSUDA Yoh-ichi wrote:

Is the above flow drawing correct or wrong?
And, John-san and Motoharu-san's patches are:


                                                           |    |
              +--------------------------------------------+    |
              |                           (NEW!)                V
              +-> converting html -> UTF-8 character -> [body]->+
                  to plain text        normalization            |
              +-------------------------------------------------+
              |      (NEW!)
              +-> tokenization -> [bayes]
                   by Mecab


My opinion is to tokenize just after charset normalization.

UTF-8 character -> tokenization -> [body]
normalization

I wrote the reason why I insist on this flow several times.  In short,

(1) to join word separated by line break (eg. "a\nb" to "ab" if "ab" is
    the word)
(2) to clarify word boundary (eg. "youwon" -> "you won")

Many Japanese spams are written in Shift-JIS codeset.
Shift-JIS detecting rule is convenience.


My opinion is yes and no.

- There are many SJIS spams but also many iso-2022-jp encoded spams.
- All SJIS mails are not spams.  A careless alert mail sent from Windows
  application is also SJIS encoded (without base64/quoted-printable
  encoding).
- There might be some tendency or difference between SJIS spam and
  iso-2022-jp spam but not so significant, I think.
- Writing rule with hex notation is troublesome, boaring and decreases
  productivity.  If we could normalize charset, we could write rule
  directly with UTF-8 aware editor.

But, if the character normalization will insert before body testing,
my rule will be unavailable.

Do I have to re-write the above 2 rules from [body] to [rawbody]?


There are two possibilities.

(1) rewrite from BODY to RAWBODY as Matsuda-san says.
(2) invent NBODY (or something else) apart from BODY.  NBODY contains
    normalized and tokenized version of body.  I once thought of this
    idea but did not propose because BODY has problems I mentioned
    above and overhead of executing nbody_test increases.

--
Motoharu Kubo
[EMAIL PROTECTED]

Re: I18n and l10n

Reply via email to