[Bug 4636] Charset normalization plugin support

bugzilla-daemon Tue, 18 Oct 2005 10:32:19 -0700

http://bugzilla.spamassassin.org/show_bug.cgi?id=4636






------- Additional Comments From [EMAIL PROTECTED]  2005-10-18 10:31 -------
Pasting from my original message from the dev list.  Since subsequent discussion
indicated that rawbody rules should not charset-normalize the text, I've since
found the body normalization best plugs into Node::rendered(). I suspect most
Western installations would not want to pay the cost of charset normalization,
so would want it disabled.


The following is a preliminary proposal for how to add support for
normalization of charsets into Perl's Unicode support.  The primary
reason I want to do this work is to improve the ability of
SpamAssassin to discriminate between Japanese ham and Japanese spam.

SpamAssassin currently ignores charset information, effectively
assuming all mail is in iso-8859-1.  This works for users whose ham is
encoded in iso-8859-1 and mostly works for users whose ham is encoded
in other single-byte charsets.  For East Asian languages, this is
insufficient for doing text analysis.

Since a large number of SpamAssassin users are likely to be
uninterested in East Asian ham and thus unlikely to want to pay the
cost of charset normalization, the normalization support needs to be
optional, defaulting to off.

Some messages contain unlabeled charsets, others use MIME charset
labels.  Some MIME charset labels are not useful
(e.g. "unknown-8bit").  To handle such nonlabeled data, it is
necessary to run a charset detector over the text in order to
determine what to convert it from.  Encode::Guess effectively requires
the caller to specify the language of the text, so I consider it too
simplistic.  Better would be Mozilla's universal charset detector,
which I would have to wrap up as a cpan module.

It is common for Korean messages to have an incorrect MIME label of
"iso-8859-1", so it may be necessary to run a charset detector even
over MIME-labeled charsets.

After the charset has been determined, either from the MIME label or
the charset detector, the data needs to be converted from that charset
to Perl's internal utf8 form.  Encode::decode() is the obvious choice
for this, though I can see reasons why an installation might want to
be able to replace the charset converters with some other
implementation.

The following functions, immediately after they all
Mail::SpamAssassin::Message::Node::decode, need to call a
function that does charset normalization.

* Mail::SpamAssassin::Message::get_rendered_body_text_array
* Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
* Mail::SpamAssassin::Message::get_decoded_body_text_array

Furthermore:

* Mail::SpamAssassin::Message::Node::_decode_header
* Mail::SpamAssassin::Message::Node::__decode_header

also need to call a function to do charset normalization.
_decode_header for unlabeled charset data, __decode_header for for
MIME encoded-words.

This new charset normalization function will take as arguments the
text and any MIME charset label.  The function calls the charset
detector and converter as necessary and returns the normalized text in
Perl's internal form.  The returned text will only have the utf8 flag
set if the input charset was not us-ascii or iso-8859-1.
This new charset normalization function should most likely use a
plugin callback to do all the work, though it only makes sense for one
loaded plugin to implement the callback.  If no plugin implements the
callback, then it should simply return the input text, preserving the
current behavior. 



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4636] Charset normalization plugin support

Reply via email to