Hello, developers. This mail is my first mail in this ML. From: "Loren Wilton" <[EMAIL PROTECTED]> Subject: Re: Charset normalization issue (report, patch, and request) Date: Sat, 14 Jan 2006 18:48:14 -0800
> > > As an outsider, I find myself strongly agreeing with Motohraru-san that, > > > when dealing with at least the oriental multibyte languages, > tokinization > > > belongs early in the stream, before both bayes and rules. > > > > > I'm not sure I understand why. > > It amounts to a form of obfuscation. He mentioned someplace that "words" do > not come with natural wordbreaks normally, so a rule to catch a given word > is /stringofletters/. But because text gets linewrapped, you might end up > with /stri\nngoflette\nrs/, which will fail to match the desired rule. Loren-san, you are confusing. Text matching rule should be 'as is'. Spammer's word obfuscation techniques are not only separating LF. 'o' -> '0', 'i' -> '1', 'l' -> '|', 'a' -> '@', and more more... Tokinization isn't fit for these techniques. If you want to make matching rule from tokenized text, the interface should be separated from other rule, 'body', 'rawbody', 'full', etc. Ex. tokenizedbody STRINGOFLETTERS /stringofletters/ But, bayes engine handles tokenized text. I don't need 'tokenizedbody' interface. -- MATSUDA Yoh-ichi(yoh) mailto:[EMAIL PROTECTED] http://www.flcl.org/~yoh/diary/ (only Japanese)
