Charset normalization issue (report, patch, and request)

Motoharu Kubo Sun, 08 Jan 2006 09:07:58 -0800

This is my first post to the list. I would like to report my testresult of charset normalization patch. In addition, I would requestfor tokenization (with experimental patch) and Bayes mechanism in orderto improve Japanese support.

At first, let me introduce myself briefly. I am an native Japanese wholives in Japan. My company offers commercial support for spam/virusfilter with SpamAssassin, amavisd-new, and Maia Mailguard. I have beenusing SA for more than two years. It works great but there are twoimportant problem for Japanese handling.


(1) It is very hard to maintain rules for Japanese word because there is
    some charsets (iso-2022-jp, shift-jis, utf-8, euc-jp) and charset
    normalization is not built-in yet.  So I have to write hex pattern
    for individual charsets.

    In additon, pattern match sometimes fail.  The pattern /$C$$/
    matches certain word as expected, but it can match different word
    with pattern /$$C$$A/.

    I am welcome to the normalization patch because I will be able to
    write rules with UTF-8 and many of this type of mismatch will be
    resolved.

    Today I tested the patch and it works great.  It could normalize
    iso-2022-jp, shift-jis, and utf-8 text body as well as MIME (base64)
    encoded header text.  In addition, it could normalize incorrect MIME
    encoded header (with charset shift-jis but actually iso-2022-jp
    text).  I modified ruleset for Japanese word with UTF-8.  These
    rules matched as expected.  I think more test by many people would
    be necessary but I would strongly request this patch to be included
    officially in next release.  Thanks to John!

(2) The bayes database contains many meaningless tokens for text body.
    As a result I feel it is unstable and new mail tends to be
    considered as spam.

    For example, the line (with iso-2022-jp)

    {ESC}$BM5J!$J?M:J%;%U%lC5$7$N7hDjHG!*{ESC}(B

    is tokenized to

    BM5JJ $bm5j!$j bm5jj $BM5J!$J ...

    o As "{ESC}$B" is an leading escape sequence, this should be
      ignored.  The first meaningful token should begin with "M5".

    o Each Japanese character needs 2-bytes.  Thus odd-byte token is
      meaningless.  "BM5JJ" should be avoided.

    o "$bm5j!$j" (converted to lower case) corresponds to different
      characters.

    With Shift-JIS each Japanese character beging with 8-bit chars
    followed by 7-bit or 8-bit chars follows.  Most information (8-bit
    chars) are lost.

I think two more enhancements are necessary to improve Japanese support.

(1) "split word with space" (tokenization) feature.  There is no space
    between words in Japanese (and Chinese, Korean).  Human can
    understand easily but tokenization is necessary for computer
    processing.  There is a program called kakasi and Text:Kakasi
    (GPLed) which handles tokenization based on special dictionary.  I
    made quick hack to John's patch experimentally and tested.

    As Kakasi does not support UTF-8, we have to convert UTF-8 to
    EUC-JP, process with kakasi, and then convert to UTF-8 again.  It is
    ugly, but it works fine.  Most word is split correctly.  The
    mismatch mentioned above will not occur.

    As spam in Japanese is increasing, supporting this kind of native
    language support would be great.

Index: lib/Mail/SpamAssassin/Message/Node.pm
===================================================================
--- Node.pm     2006-01-08 22:31:30.497174000 +0900
+++ Node.pm.new 2006-01-08 22:33:34.000000000 +0900
@@ -363,7 +363,15 @@

   dbg("Converting...");

-  return $converter->decode($data, 0);
+use Text::Kakasi;
+
+  my $res = Encode::encode("euc-jp",$converter->decode($data, 0));
+  my $rc  = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+  my $str = Text::Kakasi::do_kakasi($res);
+
+#dbg( "Kakasi: $str");
+  return Encode::decode("euc-jp",$str);
+#  return $converter->decode($data, 0);
 }

 =item rendered()
===================================================================

(2) Raw text body is passed to Bayes tokenizer.  This causes some
    difficulties.

    For example, the word "Judo" is represented by two Kanji characters.
    It is encoded as:

      [EMAIL PROTECTED];{ESC}(B             ISO-2022-JP
      0x8f 0x5f 0x93 0xb9            Shift-JIS
      0xe6 0x9f 0x94 0xe9 0x81 0x93  UTF-8
      0xbd 0xc0 0xc6 0xbb            EUC-JP

    Thus (if token is not lost) many records for the same word is
    registered and it lowers the efficacy.

    For ISO-2022-JP encoding, there is another prolem.  We use hiragana
    and katakana so often (about 30 to 70% of chars used are hiragana or
    katakana).  These characters are mapped to lower area of 7-bit pace.
    Following pattern is an example actually exists.

      {ESC}$B$3$N$3$H$K$D$$$F$O!"{ESC}(B

    Every two bytes just after starting escape sequence ({ESC}$B)
    corresponds to Japanese character ($3, $N, $3, $H, and so on).

    As mentioned above, current Base tokenizer can not handle our
    charsets well.  Only ascii words such as URIs, some technical word
    (such as Linux, Windows,...) are registered and other useful words
    are dropped.  Instead meaningless tokens are registered which would
    be noize.

    I think if Bayes could accept normalized and tokenized body (based
    on dictionary) and could handle 8-bit portion well, we could improve
    the effectivenss ot it for Japanese.

Any comment, suggestion, info is greatly appreciated.

--
Motoharu Kubo
[EMAIL PROTECTED]

Charset normalization issue (report, patch, and request)

Reply via email to