[Bug 6229] [review] TextCat is too case sensitive

bugzilla-daemon Mon, 09 May 2011 08:42:39 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229


--- Comment #21 from Mark Martinec <[email protected]> 2011-05-09 15:41:58 
UTC ---
> Btw we have a (somewhat forgotten) normalize_charset feature. :-) It converts
> rendered() body to latin1, using Encode::Detect and utf8::downgrade.

The normalize_charset suffers from two problems:
- it tries to *guess* a character set from a text sample,
  instead of taking the encoding information for a MIME subheader;
- the Bug 5691 (Slow rules due to charset normalization) is still
  applicable. The attached test case there still takes 19 times as much
  time as a non-UTF8 case using perl 5.12 (it used to be 30 time slower
  with older perl).

> I think we could discuss about it in some related or new bug. Maybe even 3.4
> could have it on by default.

I used to have normalize_charset enabled, but after being bitten by
extreme slowdowns on some mail messages, we can no longer afford
to use this feature on a production mailer. Too bad. Not something
that could be enabled by default in 3.4 if you ask me.

> In any case we probably need to keep the "lc-code" forever, since it could be
> hard to create textcat database with all case variations.. but we need to make
> sure we know the locale for body and handle accordingly.

True.

Let's just keep things simple for 3.3.2 and apply this simple patch,
then we can open a new problem report to discuss introduction of more
fancy (but also more risky) stuff like proper handling of encodings
of each message mime part.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6229] [review] TextCat is too case sensitive

Reply via email to