http://bugzilla.spamassassin.org/show_bug.cgi?id=4636
------- Additional Comments From [EMAIL PROTECTED] 2005-10-18 10:31 ------- Pasting from my original message from the dev list. Since subsequent discussion indicated that rawbody rules should not charset-normalize the text, I've since found the body normalization best plugs into Node::rendered(). I suspect most Western installations would not want to pay the cost of charset normalization, so would want it disabled. The following is a preliminary proposal for how to add support for normalization of charsets into Perl's Unicode support. The primary reason I want to do this work is to improve the ability of SpamAssassin to discriminate between Japanese ham and Japanese spam. SpamAssassin currently ignores charset information, effectively assuming all mail is in iso-8859-1. This works for users whose ham is encoded in iso-8859-1 and mostly works for users whose ham is encoded in other single-byte charsets. For East Asian languages, this is insufficient for doing text analysis. Since a large number of SpamAssassin users are likely to be uninterested in East Asian ham and thus unlikely to want to pay the cost of charset normalization, the normalization support needs to be optional, defaulting to off. Some messages contain unlabeled charsets, others use MIME charset labels. Some MIME charset labels are not useful (e.g. "unknown-8bit"). To handle such nonlabeled data, it is necessary to run a charset detector over the text in order to determine what to convert it from. Encode::Guess effectively requires the caller to specify the language of the text, so I consider it too simplistic. Better would be Mozilla's universal charset detector, which I would have to wrap up as a cpan module. It is common for Korean messages to have an incorrect MIME label of "iso-8859-1", so it may be necessary to run a charset detector even over MIME-labeled charsets. After the charset has been determined, either from the MIME label or the charset detector, the data needs to be converted from that charset to Perl's internal utf8 form. Encode::decode() is the obvious choice for this, though I can see reasons why an installation might want to be able to replace the charset converters with some other implementation. The following functions, immediately after they all Mail::SpamAssassin::Message::Node::decode, need to call a function that does charset normalization. * Mail::SpamAssassin::Message::get_rendered_body_text_array * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array * Mail::SpamAssassin::Message::get_decoded_body_text_array Furthermore: * Mail::SpamAssassin::Message::Node::_decode_header * Mail::SpamAssassin::Message::Node::__decode_header also need to call a function to do charset normalization. _decode_header for unlabeled charset data, __decode_header for for MIME encoded-words. This new charset normalization function will take as arguments the text and any MIME charset label. The function calls the charset detector and converter as necessary and returns the normalized text in Perl's internal form. The returned text will only have the utf8 flag set if the input charset was not us-ascii or iso-8859-1. This new charset normalization function should most likely use a plugin callback to do all the work, though it only makes sense for one loaded plugin to implement the callback. If no plugin implements the callback, then it should simply return the input text, preserving the current behavior. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
