Motoharu Kubo wrote: > >>The problem here is the "use bytes" pragma at the top of >>Bayes.pm--you'll want to remove that. Removing it will have some >>follow-on consequences--the "use bytes" pragma will probably also have >>to be removed from BayesStore and the other Bayes-related modules. The >>BayesStore subclasses probably will also have to be modified to become >>UTF-8 aware, storing tokens in UTF-8 form. >> >> > >I did not change because I think speed is another important factor for >mail filter. > > My experience shows that speed only becomes an issue when one ends up using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes, I believe correctness is more important. I would have to see a significant measured decrease in speed before considering sacrificing correctness for speed.
The fact that the Bayes code confuses A0 bytes in UTF-8 encoded characters with the U+00A0 character is one example of an issue that would be solved were the "use bytes" pragma removed. To be correct, the Bayes database should be storing all tokens in UTF-8, so they match regardless of how they are encoded. I'm not yet convinced that tokenization belongs inside get_rendered_body_text_array() and get_visible_rendered_body_text_array(). I suspect the content preview, which uses get_rendered_body_text_array(), would look strange were it to be tokenized. I am using get_visible_rendered_body_text_array() for something which I'm not yet convinced needs tokenization. I think this area needs some field experience.
