Re: Charset normalization issue (report, patch, and request)

Motoharu Kubo Wed, 11 Jan 2006 05:29:55 -0800

John

Thank you very much for your help.


Amazing, Bayes score for ham drastically decreased by my patch
yesterday.  I tested the same mail text with old system and new system.
Old system returnes BAYES_99, while new system returns BAYES_00!!

Although my patch is still imcomplete and bayes db on the new system is
a mixture of old-style and new-style tokens, it is an excellent result.

Today I changed:

o to make splitter function to separate Kakasi processing and moved the
  routine from Message/Node.pm to Message.pm.  It will be easier to
  replace other program.  This function sipmply returns if text contains
  no UTF-8 data, so loss of performance will be minimized for single
  byte charsets.

  splitter is called from:
     get_rendered_body_text_array()
     get_visible_rendered_body_text_array()

o bayes tokenization for long token.  Original code cuts every two bytes
  from top of token.  As multibyte UTF-8 character has at least 3 bytes,
  I modified to cut every UTF-8 character.

  I am afraid that this change is appropriate or not.

I attached my newest patch.

> The patch you include below includes most of my change, but omits the
> following hunk. Perhaps the lack of that change is your problem?
> 
> @@ -385,7 +411,7 @@
> }
> else {
> $self->{rendered_type} = $self->{type};
> - $self->{rendered} = $text;
> + $self->{rendered} = $self->{visible_rendered} = $text;
> }
> }

My mistake.  I didn't see svn.  I included this hunk and deleted my
modificatoin.  It works fine.

> The problem here is the "use bytes" pragma at the top of
> Bayes.pm--you'll want to remove that. Removing it will have some
> follow-on consequences--the "use bytes" pragma will probably also have
> to be removed from BayesStore and the other Bayes-related modules. The
> BayesStore subclasses probably will also have to be modified to become
> UTF-8 aware, storing tokens in UTF-8 form.

I did not change because I think speed is another important factor for
mail filter.

I inserted to check if data contains UTF-8 characters but it may not be
accurate.  s/([\x20-\x7f])\xa0+([\x20-\x7f])/$1$2/g would be more
accurate when using "use bytes" pragma.

Motoharu Kubo
[EMAIL PROTECTED]

diff -uNr SpamAssassin.orig/Bayes.pm SpamAssassin/Bayes.pm
--- SpamAssassin.orig/Bayes.pm	2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/Bayes.pm	2006-01-11 21:04:36.555264391 +0900
@@ -345,7 +345,7 @@
   # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam strings,
   # and ISO-8859-15 alphas.  Do not split on @'s; better results keeping it.
   # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
-  tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\241-\377 / /cs;
+  tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\200-\377 / /cs;
 
   # DO split on "..." or "--" or "---"; common formatting error resulting in
   # hapaxes.  Keep the separator itself as a token, though, as long ones can
@@ -411,11 +411,11 @@
     # the domain ".net" appeared in the To header.
     #
     if ($len > MAX_TOKEN_LENGTH && $token !~ /\*/) {
-      if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
+      if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xc0-\xff][\x80-\xbf]{2,}/) {
 	# Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
 	# but I'm doing tuples to keep the dbs small(er)."  Sounds like a plan
 	# to me! (jm)
-	while ($token =~ s/^(..?)//) {
+	while ($token =~ s/^([\xc0-\xff][\x80-\xbf]{2,})//) {
 	  push (@rettokens, "8:$1");
 	}
 	next;
diff -uNr SpamAssassin.orig/HTML.pm SpamAssassin/HTML.pm
--- SpamAssassin.orig/HTML.pm	2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/HTML.pm	2006-01-10 22:45:26.000000000 +0900
@@ -742,7 +742,12 @@
     }
   }
   else {
-    $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+    if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+      $text =~ s/[ \t\n\r\f\x0b]+/ /g;
+    }
+    else {
+      $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+    }
     # trim leading whitespace if previous element was whitespace
     if (@{ $self->{text} } &&
 	defined $self->{text_whitespace} &&
diff -uNr SpamAssassin.orig/Message/Node.pm SpamAssassin/Message/Node.pm
--- SpamAssassin.orig/Message/Node.pm	2005-08-12 09:38:46.000000000 +0900
+++ SpamAssassin/Message/Node.pm	2006-01-11 21:08:33.547919446 +0900
@@ -42,6 +42,8 @@
 use Mail::SpamAssassin::HTML;
 use Mail::SpamAssassin::Logger;
 
+our $normalize_supported = ( $] > 5.008004 && eval 'require Encode::Detect::Detector' && eval 'require Encode' );
+
 =item new()
 
 Generates an empty Node object and returns it.  Typically only called
@@ -342,6 +344,28 @@
   return 0;
 }
 
+sub _normalize {
+  my ($data, $charset) = @_;
+  return $data unless $normalize_supported;
+  my $detected = Encode::Detect::Detector::detect($data);
+  dbg("Detected charset ".($detected || 'none'));
+
+  my $converter;
+
+  if ($charset && ($detected || 'none') !~ /^(?:UTF|EUC|ISO-2022|Shift_JIS|Big5|GB)/i) {
+      dbg("Using labeled charset $charset");
+      $converter = Encode::find_encoding($charset);
+  }
+
+  $converter = Encode::find_encoding($detected) unless $converter || !defined($detected);
+
+  return $data unless $converter;
+
+  dbg("Converting...");
+
+  return $converter->decode($data, 0);
+}
+
 =item rendered()
 
 render_text() takes the given text/* type MIME part, and attempts to
@@ -359,7 +383,7 @@
   return(undef,undef) unless ( $self->{'type'} =~ /^text\b/i );
 
   if (!exists $self->{rendered}) {
-    my $text = $self->decode();
+    my $text = _normalize($self->decode(), $self->{charset});
     my $raw = length($text);
 
     # render text/html always, or any other text|text/plain part as text/html
@@ -386,7 +410,7 @@
     }
     else {
       $self->{rendered_type} = $self->{type};
-      $self->{rendered} = $text;
+      $self->{rendered} = $self->{visible_rendered} = $text;
     }
   }
 
@@ -478,7 +502,7 @@
 
   if ( $cte eq 'B' ) {
     # base 64 encoded
-    return Mail::SpamAssassin::Util::base64_decode($data);
+    $data = Mail::SpamAssassin::Util::base64_decode($data);
   }
   elsif ( $cte eq 'Q' ) {
     # quoted printable
@@ -486,12 +510,13 @@
     # the RFC states that in the encoded text, "_" is equal to "=20"
     $data =~ s/_/=20/g;
 
-    return Mail::SpamAssassin::Util::qp_decode($data);
+    $data = Mail::SpamAssassin::Util::qp_decode($data);
   }
   else {
     # not possible since the input has already been limited to 'B' and 'Q'
     die "message: unknown encoding type '$cte' in RFC2047 header";
   }
+  return _normalize($data, $encoding);
 }
 
 # Decode base64 and quoted-printable in headers according to RFC2047.
@@ -505,15 +530,15 @@
   $header =~ s/\n[ \t]+/\n /g;
   $header =~ s/\r?\n//g;
 
-  return $header unless $header =~ /=\?/;
-
   # multiple encoded sections must ignore the interim whitespace.
   # to avoid possible FPs with (\s+(?==\?))?, look for the whole RE
   # separated by whitespace.
   1 while ($header =~ s/(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)\s+(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)/$1$2/g);
 
-  $header =~
-    s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2), $3)/ge;
+  unless ($header =~
+	  s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2), $3)/ge) {
+    $header = _normalize($header);
+  }
 
   return $header;
 }
diff -uNr SpamAssassin.orig/Message.pm SpamAssassin/Message.pm
--- SpamAssassin.orig/Message.pm	2005-09-14 11:07:31.000000000 +0900
+++ SpamAssassin/Message.pm	2006-01-11 21:07:15.045589574 +0900
@@ -760,6 +760,7 @@
   # 0: content-type, 1: boundary, 2: charset, 3: filename
   my @ct = Mail::SpamAssassin::Util::parse_content_type($part_msg->header('content-type'));
   $part_msg->{'type'} = $ct[0];
+  $part_msg->{'charset'} = $ct[2];
 
   # multipart sections are required to have a boundary set ...  If this
   # one doesn't, assume it's malformed and revert to text/plain
@@ -871,12 +872,17 @@
 
   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;		# double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  }
   $text =~ tr/\f/\n/;			# form feeds => newline
   
   # warn "message: $text";
 
-  my @textary = split_into_array_of_short_lines ($text);
+  my @textary = split_into_array_of_short_lines (splitter($text));
   $self->{text_rendered} = [EMAIL PROTECTED];
 
   return $self->{text_rendered};
@@ -931,10 +937,15 @@
 
   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;		# double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  }
   $text =~ tr/\f/\n/;			# form feeds => newline
 
-  my @textary = split_into_array_of_short_lines ($text);
+  my @textary = split_into_array_of_short_lines (splitter($text));
   $self->{text_visible_rendered} = [EMAIL PROTECTED];
 
   return $self->{text_visible_rendered};
@@ -982,7 +993,13 @@
 
   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;		# double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  }
+  $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
   $text =~ tr/\f/\n/;			# form feeds => newline
 
   my @textary = split_into_array_of_short_lines ($text);
@@ -1028,6 +1045,25 @@
 
 # ---------------------------------------------------------------------------
 
+sub splitter {
+  my ($text) = @_;
+
+  if ( $text !~ /[\xc0-\xff][\x80-\xbf]{2,}/ ) { return $text; }
+
+  $text =~ s/([\xc0-\xff][\x80-\xbf]{2,})[ \n]+([\xc0-\xff][\x80-\xbf]{2,})/$1$2/gs;
+
+  use Text::Kakasi;
+  Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+
+  my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
+  my $str = Text::Kakasi::do_kakasi($res);
+  my $utf8= Encode::decode("euc-jp",$str);
+
+  return $utf8;
+}
+
+# ---------------------------------------------------------------------------
+
 1;
 
 =back
diff -uNr SpamAssassin.orig/Util/DependencyInfo.pm SpamAssassin/Util/DependencyInfo.pm
--- SpamAssassin.orig/Util/DependencyInfo.pm	2005-09-14 11:07:31.000000000 +0900
+++ SpamAssassin/Util/DependencyInfo.pm	2006-01-10 22:45:26.000000000 +0900
@@ -168,6 +168,12 @@
   desc => 'The "sa-update" script requires this module to access compressed
   update archive files.',
 },
+{
+  module => 'Encode::Detect',
+  version => '0.00',
+  desc => 'If this module is installed, SpamAssassin will detect charsets
+  and convert them into Unicode.',
+},
 );
 
 ###########################################################################

Re: Charset normalization issue (report, patch, and request)

Reply via email to