[Bug 4636] Charset normalization plugin support

bugzilla-daemon Tue, 18 Oct 2005 13:14:13 -0700

http://bugzilla.spamassassin.org/show_bug.cgi?id=4636

------- Additional Comments From [EMAIL PROTECTED]  2005-10-18 13:13 -------
Subject: Re:  Charset normalization plugin support

On Tue, Oct 18, 2005 at 10:31:55AM -0700, [EMAIL PROTECTED] wrote:
> indicated that rawbody rules should not charset-normalize the text, I've since
> found the body normalization best plugs into Node::rendered(). I suspect most
> Western installations would not want to pay the cost of charset normalization,
> so would want it disabled.

Earlier in the ticket you were talking about header normalization.  Body
normalization is a different beast (but it's easier to deal with imo).

> The following is a preliminary proposal for how to add support for
> normalization of charsets into Perl's Unicode support.  The primary
> reason I want to do this work is to improve the ability of
> SpamAssassin to discriminate between Japanese ham and Japanese spam.

It's worth noting that this is actually going to be a much larger issue
than just having a plugin, btw.  The main problem is that SpamAssassin
very specifically disables unicode in every module via "use bytes"
(according to the svn log it looks like it was added in at r3997 back
in December 2002).

> Since a large number of SpamAssassin users are likely to be
> uninterested in East Asian ham and thus unlikely to want to pay the
> cost of charset normalization, the normalization support needs to be
> optional, defaulting to off.
> 
> The following functions, immediately after they all
> Mail::SpamAssassin::Message::Node::decode, need to call a
> function that does charset normalization.
> 
> * Mail::SpamAssassin::Message::get_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_decoded_body_text_array

I was thinking that the plugin would be called by check_start, then
get an array of parts via find_parts(), then do any manipulation of
the data as required per-part (either dealing with the decoded or the
rendered portions, or both).   Since find_parts() returns references to
the actual parts in the tree, you can just modify as necessary without
jumping through a lot of hoops.  Then later when those other functions
get called, everything would already be normalized out.

Potentially, there'd be a new function in Message like
"clear_rendered_cache" or something which would delete the cached forms
of text_rendered, text_visible_rendered, text_invisible_rendered, and
(if necessary/different function) text_decoded.  That way you would be
sure that the normalization data is what's used after the process occurs,
even if those other functions were called by something else previously.

> * Mail::SpamAssassin::Message::Node::_decode_header
> * Mail::SpamAssassin::Message::Node::__decode_header
> 
> also need to call a function to do charset normalization.
> _decode_header for unlabeled charset data, __decode_header for for
> MIME encoded-words.

I can't think of an easy way to do this other than to do the work in Node
itself, or to have a plugin do something similar for the headers as suggested
above with the body and manipulate the internal data directly.

It's not very clean from an OO perspective.  Arguably we'd always want
to make sure the message is in utf-8 format internally, and so the code
could just be in Message::Node.

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4636] Charset normalization plugin support

Reply via email to