This is my first post to the list. I would like to report my test
result of charset normalization patch. In addition, I would request
for tokenization (with experimental patch) and Bayes mechanism in order
to improve Japanese support.
At first, let me introduce myself briefly. I am an native Japanese who
lives in Japan. My company offers commercial support for spam/virus
filter with SpamAssassin, amavisd-new, and Maia Mailguard. I have been
using SA for more than two years. It works great but there are two
important problem for Japanese handling.
(1) It is very hard to maintain rules for Japanese word because there is
some charsets (iso-2022-jp, shift-jis, utf-8, euc-jp) and charset
normalization is not built-in yet. So I have to write hex pattern
for individual charsets.
In additon, pattern match sometimes fail. The pattern /$C$$/
matches certain word as expected, but it can match different word
with pattern /$$C$$A/.
I am welcome to the normalization patch because I will be able to
write rules with UTF-8 and many of this type of mismatch will be
resolved.
Today I tested the patch and it works great. It could normalize
iso-2022-jp, shift-jis, and utf-8 text body as well as MIME (base64)
encoded header text. In addition, it could normalize incorrect MIME
encoded header (with charset shift-jis but actually iso-2022-jp
text). I modified ruleset for Japanese word with UTF-8. These
rules matched as expected. I think more test by many people would
be necessary but I would strongly request this patch to be included
officially in next release. Thanks to John!
(2) The bayes database contains many meaningless tokens for text body.
As a result I feel it is unstable and new mail tends to be
considered as spam.
For example, the line (with iso-2022-jp)
{ESC}$BM5J!$J?M:J%;%U%lC5$7$N7hDjHG!*{ESC}(B
is tokenized to
BM5JJ $bm5j!$j bm5jj $BM5J!$J ...
o As "{ESC}$B" is an leading escape sequence, this should be
ignored. The first meaningful token should begin with "M5".
o Each Japanese character needs 2-bytes. Thus odd-byte token is
meaningless. "BM5JJ" should be avoided.
o "$bm5j!$j" (converted to lower case) corresponds to different
characters.
With Shift-JIS each Japanese character beging with 8-bit chars
followed by 7-bit or 8-bit chars follows. Most information (8-bit
chars) are lost.
I think two more enhancements are necessary to improve Japanese support.
(1) "split word with space" (tokenization) feature. There is no space
between words in Japanese (and Chinese, Korean). Human can
understand easily but tokenization is necessary for computer
processing. There is a program called kakasi and Text:Kakasi
(GPLed) which handles tokenization based on special dictionary. I
made quick hack to John's patch experimentally and tested.
As Kakasi does not support UTF-8, we have to convert UTF-8 to
EUC-JP, process with kakasi, and then convert to UTF-8 again. It is
ugly, but it works fine. Most word is split correctly. The
mismatch mentioned above will not occur.
As spam in Japanese is increasing, supporting this kind of native
language support would be great.
Index: lib/Mail/SpamAssassin/Message/Node.pm
===================================================================
--- Node.pm 2006-01-08 22:31:30.497174000 +0900
+++ Node.pm.new 2006-01-08 22:33:34.000000000 +0900
@@ -363,7 +363,15 @@
dbg("Converting...");
- return $converter->decode($data, 0);
+use Text::Kakasi;
+
+ my $res = Encode::encode("euc-jp",$converter->decode($data, 0));
+ my $rc = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+ my $str = Text::Kakasi::do_kakasi($res);
+
+#dbg( "Kakasi: $str");
+ return Encode::decode("euc-jp",$str);
+# return $converter->decode($data, 0);
}
=item rendered()
===================================================================
(2) Raw text body is passed to Bayes tokenizer. This causes some
difficulties.
For example, the word "Judo" is represented by two Kanji characters.
It is encoded as:
[EMAIL PROTECTED];{ESC}(B ISO-2022-JP
0x8f 0x5f 0x93 0xb9 Shift-JIS
0xe6 0x9f 0x94 0xe9 0x81 0x93 UTF-8
0xbd 0xc0 0xc6 0xbb EUC-JP
Thus (if token is not lost) many records for the same word is
registered and it lowers the efficacy.
For ISO-2022-JP encoding, there is another prolem. We use hiragana
and katakana so often (about 30 to 70% of chars used are hiragana or
katakana). These characters are mapped to lower area of 7-bit pace.
Following pattern is an example actually exists.
{ESC}$B$3$N$3$H$K$D$$$F$O!"{ESC}(B
Every two bytes just after starting escape sequence ({ESC}$B)
corresponds to Japanese character ($3, $N, $3, $H, and so on).
As mentioned above, current Base tokenizer can not handle our
charsets well. Only ascii words such as URIs, some technical word
(such as Linux, Windows,...) are registered and other useful words
are dropped. Instead meaningless tokens are registered which would
be noize.
I think if Bayes could accept normalized and tokenized body (based
on dictionary) and could handle 8-bit portion well, we could improve
the effectivenss ot it for Japanese.
Any comment, suggestion, info is greatly appreciated.
--
Motoharu Kubo
[EMAIL PROTECTED]