[Bug 6229] [review] TextCat is too case sensitive

bugzilla-daemon Fri, 06 May 2011 00:01:05 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229


--- Comment #9 from Henrik Krohns <[email protected]> 2011-05-06 07:00:21 UTC ---
(In reply to comment #8)
> > Too much technical debate for 3.3.2 consideration.  Retargeting to 3.4.0.
> 
> How about just doing a plain lc for now, which will at least
> handle all-ascii text such as English:
> 
> - $word = "\000" . $word . "\000";
> + $word = "\000" . lc($word) . "\000";
> 
> and leave the bug open for a better solution in 3.4 ?

I'm currently trying a "proper" set of characters.. imo lc is too vague and
locale dependent.

$word =~ tr/A-Z\xc0-\xd6\xd8-\xde/a-z\xe0-\xf6\xf8-\xfe/
if $word =~ /[A-Z]/ && $word =~
/[a-zA-Z\xc0-\xd6\xd8-\xde\xe0-\xf6\xf8-\xfe]{4}/;

This table includes all latin accents.

foreach (192..214, 216..222) {
    printf "%s %x %s %s %x %s\n", $_, $_, chr($_), $_ + 32, $_ + 32, chr($_ +
32);
}

Also I'm quite certain that lowering textcat_acceptable_score to 1.02 is also
the right thing to do. I'm currently making a small corpus of different
languages, including a separate fp corpus. I'll have some results soon..

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6229] [review] TextCat is too case sensitive

Reply via email to