http://bugzilla.spamassassin.org/show_bug.cgi?id=4636
------- Additional Comments From [EMAIL PROTECTED] 2005-10-18 13:13 ------- Subject: Re: Charset normalization plugin support On Tue, Oct 18, 2005 at 10:31:55AM -0700, [EMAIL PROTECTED] wrote: > indicated that rawbody rules should not charset-normalize the text, I've since > found the body normalization best plugs into Node::rendered(). I suspect most > Western installations would not want to pay the cost of charset normalization, > so would want it disabled. Earlier in the ticket you were talking about header normalization. Body normalization is a different beast (but it's easier to deal with imo). > The following is a preliminary proposal for how to add support for > normalization of charsets into Perl's Unicode support. The primary > reason I want to do this work is to improve the ability of > SpamAssassin to discriminate between Japanese ham and Japanese spam. It's worth noting that this is actually going to be a much larger issue than just having a plugin, btw. The main problem is that SpamAssassin very specifically disables unicode in every module via "use bytes" (according to the svn log it looks like it was added in at r3997 back in December 2002). > Since a large number of SpamAssassin users are likely to be > uninterested in East Asian ham and thus unlikely to want to pay the > cost of charset normalization, the normalization support needs to be > optional, defaulting to off. > > The following functions, immediately after they all > Mail::SpamAssassin::Message::Node::decode, need to call a > function that does charset normalization. > > * Mail::SpamAssassin::Message::get_rendered_body_text_array > * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array > * Mail::SpamAssassin::Message::get_decoded_body_text_array I was thinking that the plugin would be called by check_start, then get an array of parts via find_parts(), then do any manipulation of the data as required per-part (either dealing with the decoded or the rendered portions, or both). Since find_parts() returns references to the actual parts in the tree, you can just modify as necessary without jumping through a lot of hoops. Then later when those other functions get called, everything would already be normalized out. Potentially, there'd be a new function in Message like "clear_rendered_cache" or something which would delete the cached forms of text_rendered, text_visible_rendered, text_invisible_rendered, and (if necessary/different function) text_decoded. That way you would be sure that the normalization data is what's used after the process occurs, even if those other functions were called by something else previously. > * Mail::SpamAssassin::Message::Node::_decode_header > * Mail::SpamAssassin::Message::Node::__decode_header > > also need to call a function to do charset normalization. > _decode_header for unlabeled charset data, __decode_header for for > MIME encoded-words. I can't think of an easy way to do this other than to do the work in Node itself, or to have a plugin do something similar for the headers as suggested above with the body and manipulate the internal data directly. It's not very clean from an OO perspective. Arguably we'd always want to make sure the message is in utf-8 format internally, and so the code could just be in Message::Node. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
