John
Thank you very much for your help.
Amazing, Bayes score for ham drastically decreased by my patch
yesterday. I tested the same mail text with old system and new system.
Old system returnes BAYES_99, while new system returns BAYES_00!!
Although my patch is still imcomplete and bayes db on the new system is
a mixture of old-style and new-style tokens, it is an excellent result.
Today I changed:
o to make splitter function to separate Kakasi processing and moved the
routine from Message/Node.pm to Message.pm. It will be easier to
replace other program. This function sipmply returns if text contains
no UTF-8 data, so loss of performance will be minimized for single
byte charsets.
splitter is called from:
get_rendered_body_text_array()
get_visible_rendered_body_text_array()
o bayes tokenization for long token. Original code cuts every two bytes
from top of token. As multibyte UTF-8 character has at least 3 bytes,
I modified to cut every UTF-8 character.
I am afraid that this change is appropriate or not.
I attached my newest patch.
> The patch you include below includes most of my change, but omits the
> following hunk. Perhaps the lack of that change is your problem?
>
> @@ -385,7 +411,7 @@
> }
> else {
> $self->{rendered_type} = $self->{type};
> - $self->{rendered} = $text;
> + $self->{rendered} = $self->{visible_rendered} = $text;
> }
> }
My mistake. I didn't see svn. I included this hunk and deleted my
modificatoin. It works fine.
> The problem here is the "use bytes" pragma at the top of
> Bayes.pm--you'll want to remove that. Removing it will have some
> follow-on consequences--the "use bytes" pragma will probably also have
> to be removed from BayesStore and the other Bayes-related modules. The
> BayesStore subclasses probably will also have to be modified to become
> UTF-8 aware, storing tokens in UTF-8 form.
I did not change because I think speed is another important factor for
mail filter.
I inserted to check if data contains UTF-8 characters but it may not be
accurate. s/([\x20-\x7f])\xa0+([\x20-\x7f])/$1$2/g would be more
accurate when using "use bytes" pragma.
Motoharu Kubo
[EMAIL PROTECTED]
diff -uNr SpamAssassin.orig/Bayes.pm SpamAssassin/Bayes.pm
--- SpamAssassin.orig/Bayes.pm 2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/Bayes.pm 2006-01-11 21:04:36.555264391 +0900
@@ -345,7 +345,7 @@
# include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam strings,
# and ISO-8859-15 alphas. Do not split on @'s; better results keeping it.
# Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
- tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\241-\377 / /cs;
+ tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\200-\377 / /cs;
# DO split on "..." or "--" or "---"; common formatting error resulting in
# hapaxes. Keep the separator itself as a token, though, as long ones can
@@ -411,11 +411,11 @@
# the domain ".net" appeared in the To header.
#
if ($len > MAX_TOKEN_LENGTH && $token !~ /\*/) {
- if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
+ if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xc0-\xff][\x80-\xbf]{2,}/) {
# Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
# but I'm doing tuples to keep the dbs small(er)." Sounds like a plan
# to me! (jm)
- while ($token =~ s/^(..?)//) {
+ while ($token =~ s/^([\xc0-\xff][\x80-\xbf]{2,})//) {
push (@rettokens, "8:$1");
}
next;
diff -uNr SpamAssassin.orig/HTML.pm SpamAssassin/HTML.pm
--- SpamAssassin.orig/HTML.pm 2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/HTML.pm 2006-01-10 22:45:26.000000000 +0900
@@ -742,7 +742,12 @@
}
}
else {
- $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+ if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+ $text =~ s/[ \t\n\r\f\x0b]+/ /g;
+ }
+ else {
+ $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+ }
# trim leading whitespace if previous element was whitespace
if (@{ $self->{text} } &&
defined $self->{text_whitespace} &&
diff -uNr SpamAssassin.orig/Message/Node.pm SpamAssassin/Message/Node.pm
--- SpamAssassin.orig/Message/Node.pm 2005-08-12 09:38:46.000000000 +0900
+++ SpamAssassin/Message/Node.pm 2006-01-11 21:08:33.547919446 +0900
@@ -42,6 +42,8 @@
use Mail::SpamAssassin::HTML;
use Mail::SpamAssassin::Logger;
+our $normalize_supported = ( $] > 5.008004 && eval 'require Encode::Detect::Detector' && eval 'require Encode' );
+
=item new()
Generates an empty Node object and returns it. Typically only called
@@ -342,6 +344,28 @@
return 0;
}
+sub _normalize {
+ my ($data, $charset) = @_;
+ return $data unless $normalize_supported;
+ my $detected = Encode::Detect::Detector::detect($data);
+ dbg("Detected charset ".($detected || 'none'));
+
+ my $converter;
+
+ if ($charset && ($detected || 'none') !~ /^(?:UTF|EUC|ISO-2022|Shift_JIS|Big5|GB)/i) {
+ dbg("Using labeled charset $charset");
+ $converter = Encode::find_encoding($charset);
+ }
+
+ $converter = Encode::find_encoding($detected) unless $converter || !defined($detected);
+
+ return $data unless $converter;
+
+ dbg("Converting...");
+
+ return $converter->decode($data, 0);
+}
+
=item rendered()
render_text() takes the given text/* type MIME part, and attempts to
@@ -359,7 +383,7 @@
return(undef,undef) unless ( $self->{'type'} =~ /^text\b/i );
if (!exists $self->{rendered}) {
- my $text = $self->decode();
+ my $text = _normalize($self->decode(), $self->{charset});
my $raw = length($text);
# render text/html always, or any other text|text/plain part as text/html
@@ -386,7 +410,7 @@
}
else {
$self->{rendered_type} = $self->{type};
- $self->{rendered} = $text;
+ $self->{rendered} = $self->{visible_rendered} = $text;
}
}
@@ -478,7 +502,7 @@
if ( $cte eq 'B' ) {
# base 64 encoded
- return Mail::SpamAssassin::Util::base64_decode($data);
+ $data = Mail::SpamAssassin::Util::base64_decode($data);
}
elsif ( $cte eq 'Q' ) {
# quoted printable
@@ -486,12 +510,13 @@
# the RFC states that in the encoded text, "_" is equal to "=20"
$data =~ s/_/=20/g;
- return Mail::SpamAssassin::Util::qp_decode($data);
+ $data = Mail::SpamAssassin::Util::qp_decode($data);
}
else {
# not possible since the input has already been limited to 'B' and 'Q'
die "message: unknown encoding type '$cte' in RFC2047 header";
}
+ return _normalize($data, $encoding);
}
# Decode base64 and quoted-printable in headers according to RFC2047.
@@ -505,15 +530,15 @@
$header =~ s/\n[ \t]+/\n /g;
$header =~ s/\r?\n//g;
- return $header unless $header =~ /=\?/;
-
# multiple encoded sections must ignore the interim whitespace.
# to avoid possible FPs with (\s+(?==\?))?, look for the whole RE
# separated by whitespace.
1 while ($header =~ s/(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)\s+(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)/$1$2/g);
- $header =~
- s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2), $3)/ge;
+ unless ($header =~
+ s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2), $3)/ge) {
+ $header = _normalize($header);
+ }
return $header;
}
diff -uNr SpamAssassin.orig/Message.pm SpamAssassin/Message.pm
--- SpamAssassin.orig/Message.pm 2005-09-14 11:07:31.000000000 +0900
+++ SpamAssassin/Message.pm 2006-01-11 21:07:15.045589574 +0900
@@ -760,6 +760,7 @@
# 0: content-type, 1: boundary, 2: charset, 3: filename
my @ct = Mail::SpamAssassin::Util::parse_content_type($part_msg->header('content-type'));
$part_msg->{'type'} = $ct[0];
+ $part_msg->{'charset'} = $ct[2];
# multipart sections are required to have a boundary set ... If this
# one doesn't, assume it's malformed and revert to text/plain
@@ -871,12 +872,17 @@
# whitespace handling (warning: small changes have large effects!)
$text =~ s/\n+\s*\n+/\f/gs; # double newlines => form feed
- $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
+ if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+ $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
+ }
+ else {
+ $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
+ }
$text =~ tr/\f/\n/; # form feeds => newline
# warn "message: $text";
- my @textary = split_into_array_of_short_lines ($text);
+ my @textary = split_into_array_of_short_lines (splitter($text));
$self->{text_rendered} = [EMAIL PROTECTED];
return $self->{text_rendered};
@@ -931,10 +937,15 @@
# whitespace handling (warning: small changes have large effects!)
$text =~ s/\n+\s*\n+/\f/gs; # double newlines => form feed
- $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
+ if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+ $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
+ }
+ else {
+ $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
+ }
$text =~ tr/\f/\n/; # form feeds => newline
- my @textary = split_into_array_of_short_lines ($text);
+ my @textary = split_into_array_of_short_lines (splitter($text));
$self->{text_visible_rendered} = [EMAIL PROTECTED];
return $self->{text_visible_rendered};
@@ -982,7 +993,13 @@
# whitespace handling (warning: small changes have large effects!)
$text =~ s/\n+\s*\n+/\f/gs; # double newlines => form feed
- $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
+ if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+ $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
+ }
+ else {
+ $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
+ }
+ $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
$text =~ tr/\f/\n/; # form feeds => newline
my @textary = split_into_array_of_short_lines ($text);
@@ -1028,6 +1045,25 @@
# ---------------------------------------------------------------------------
+sub splitter {
+ my ($text) = @_;
+
+ if ( $text !~ /[\xc0-\xff][\x80-\xbf]{2,}/ ) { return $text; }
+
+ $text =~ s/([\xc0-\xff][\x80-\xbf]{2,})[ \n]+([\xc0-\xff][\x80-\xbf]{2,})/$1$2/gs;
+
+ use Text::Kakasi;
+ Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+
+ my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
+ my $str = Text::Kakasi::do_kakasi($res);
+ my $utf8= Encode::decode("euc-jp",$str);
+
+ return $utf8;
+}
+
+# ---------------------------------------------------------------------------
+
1;
=back
diff -uNr SpamAssassin.orig/Util/DependencyInfo.pm SpamAssassin/Util/DependencyInfo.pm
--- SpamAssassin.orig/Util/DependencyInfo.pm 2005-09-14 11:07:31.000000000 +0900
+++ SpamAssassin/Util/DependencyInfo.pm 2006-01-10 22:45:26.000000000 +0900
@@ -168,6 +168,12 @@
desc => 'The "sa-update" script requires this module to access compressed
update archive files.',
},
+{
+ module => 'Encode::Detect',
+ version => '0.00',
+ desc => 'If this module is installed, SpamAssassin will detect charsets
+ and convert them into Unicode.',
+},
);
###########################################################################