Re: Charset normalization issue (report, patch, and request)

Justin Mason Sat, 14 Jan 2006 16:57:47 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"Loren Wilton" writes:
> As an outsider, I find myself strongly agreeing with Motohraru-san that,
> when dealing with at least the oriental multibyte languages, tokinization
> belongs early in the stream, before both bayes and rules.
> 
> Of course this is an overhead penalty that should not occur on mail that
> isn't likely to be encoded in this manner.  So this should be something that
> only happens in the appropriate circumstances.  Whether that is a user
> option in the config, or is something that can be determined on the fly from
> the charset declarations I do not know.


I'm not sure I understand why.

Currently, Bayes is the only code that actually *uses* knowledge of how a
string is tokenized into words; this isn't exposed to the rules at all.

If it should be, that's an entirely separate feature request. ;)

> I would hope that the check to determine whether splitting is something that
> will be done relatively infrequently, say no more than once per body section
> in the mail or so, and not on a per-token basis.  Given that that is the
> case, I think that splitting at the front would be appropriate for common
> code in all distributions.
> 
> If the check must be repeated with great frequency, or the check is
> inherently painful, then perhaps there should be two versions of the
> functions making these splitting decisions, and which to use would be
> conditioned by a user option.

I don't understand these paragraphs :(

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDyZ3nMJF5cimLx9ARAiIJAJ9csLEZT6mDoCAThhRKOai43nbiuwCgssL3
oYwVIRVrbDdAdfevW+Uzdis=
=yzYu
-----END PGP SIGNATURE-----

Re: Charset normalization issue (report, patch, and request)

Reply via email to