Re: Charset normalization issue (report, patch, and request)

John Myers Fri, 13 Jan 2006 16:57:00 -0800

Motoharu Kubo wrote:

>
>>The problem here is the "use bytes" pragma at the top of
>>Bayes.pm--you'll want to remove that. Removing it will have some
>>follow-on consequences--the "use bytes" pragma will probably also have
>>to be removed from BayesStore and the other Bayes-related modules. The
>>BayesStore subclasses probably will also have to be modified to become
>>UTF-8 aware, storing tokens in UTF-8 form.
>>    
>>
>
>I did not change because I think speed is another important factor for
>mail filter.
>  
>
My experience shows that speed only becomes an issue when one ends up
using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes,
I believe correctness is more important. I would have to see a
significant measured decrease in speed before considering sacrificing
correctness for speed.


The fact that the Bayes code confuses A0 bytes in UTF-8 encoded
characters with the U+00A0 character is one example of an issue that
would be solved were the "use bytes" pragma removed. To be correct, the
Bayes database should be storing all tokens in UTF-8, so they match
regardless of how they are encoded.


I'm not yet convinced that tokenization belongs inside
get_rendered_body_text_array() and
get_visible_rendered_body_text_array(). I suspect the content preview,
which uses get_rendered_body_text_array(), would look strange were it to
be tokenized. I am using get_visible_rendered_body_text_array() for
something which I'm not yet convinced needs tokenization. I think this
area needs some field experience.

Re: Charset normalization issue (report, patch, and request)

Reply via email to