Re: Charset normalization issue (report, patch, and request)

Loren Wilton Sat, 14 Jan 2006 15:51:50 -0800

As an outsider, I find myself strongly agreeing with Motohraru-san that,
when dealing with at least the oriental multibyte languages, tokinization
belongs early in the stream, before both bayes and rules.


Of course this is an overhead penalty that should not occur on mail that
isn't likely to be encoded in this manner.  So this should be something that
only happens in the appropriate circumstances.  Whether that is a user
option in the config, or is something that can be determined on the fly from
the charset declarations I do not know.

I would hope that the check to determine whether splitting is something that
will be done relatively infrequently, say no more than once per body section
in the mail or so, and not on a per-token basis.  Given that that is the
case, I think that splitting at the front would be appropriate for common
code in all distributions.

If the check must be repeated with great frequency, or the check is
inherently painful, then perhaps there should be two versions of the
functions making these splitting decisions, and which to use would be
conditioned by a user option.

        Loren

Re: Charset normalization issue (report, patch, and request)

Reply via email to