https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229

           Summary: TextCat is too case sensitive
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Plugins
        AssignedTo: [email protected]
        ReportedBy: [email protected]


Created an attachment (id=4562)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4562)
TextCat problem sample

It seems the languages database is case sensitive. For example, all uppercase
english spams get very wonky results.

I have no idea what the best way to fix this would be, I'm using a quick fix
like this to get better results..

--- TextCat.pm.orig   2009-10-29 09:23:46.985152046 +0200
+++ TextCat.pm  2009-10-29 09:24:38.339651987 +0200
@@ -440,6 +440,7 @@
   # my $non_word_characters = qr/[0-9\s]/;
   for my $word (split(/[0-9\s]+/, ${$_[0]}))
   {
+    $word =~ tr/A-ZÖÄÅ/a-zöäå/ if $word =~ /[a-zA-ZöäåÖÄÅ]{4}/;
     $word = "\000" . $word . "\000";
     my $len = length($word);
     my $flen = $len;

Attached is a sample message. Running it with textcat_max_languages 20 gives
us:

ja.iso-2022-jp de zh.big5 sk.windows-1250 id sk.us-ascii cs.iso-8859-2 ca da vi
sw ms tl ne pl

Running it with my fix gives the expected single "en".

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to