-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 "Loren Wilton" writes: > As an outsider, I find myself strongly agreeing with Motohraru-san that, > when dealing with at least the oriental multibyte languages, tokinization > belongs early in the stream, before both bayes and rules. > > Of course this is an overhead penalty that should not occur on mail that > isn't likely to be encoded in this manner. So this should be something that > only happens in the appropriate circumstances. Whether that is a user > option in the config, or is something that can be determined on the fly from > the charset declarations I do not know.
I'm not sure I understand why. Currently, Bayes is the only code that actually *uses* knowledge of how a string is tokenized into words; this isn't exposed to the rules at all. If it should be, that's an entirely separate feature request. ;) > I would hope that the check to determine whether splitting is something that > will be done relatively infrequently, say no more than once per body section > in the mail or so, and not on a per-token basis. Given that that is the > case, I think that splitting at the front would be appropriate for common > code in all distributions. > > If the check must be repeated with great frequency, or the check is > inherently painful, then perhaps there should be two versions of the > functions making these splitting decisions, and which to use would be > conditioned by a user option. I don't understand these paragraphs :( - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFDyZ3nMJF5cimLx9ARAiIJAJ9csLEZT6mDoCAThhRKOai43nbiuwCgssL3 oYwVIRVrbDdAdfevW+Uzdis= =yzYu -----END PGP SIGNATURE-----
