Re: Charset normalization issue (report, patch, and request)

Loren Wilton Sat, 14 Jan 2006 18:48:39 -0800

> > As an outsider, I find myself strongly agreeing with Motohraru-san that,
> > when dealing with at least the oriental multibyte languages,
tokinization
> > belongs early in the stream, before both bayes and rules.
> >
> I'm not sure I understand why.


It amounts to a form of obfuscation.  He mentioned someplace that "words" do
not come with natural wordbreaks normally, so a rule to catch a given word
is /stringofletters/.  But because text gets linewrapped, you might end up
with /stri\nngoflette\nrs/, which will fail to match the desired rule.

Having fought precisely this sort of obfuscation trick spammers use, where
in English the only way to catch something is with a 'full' rule (because it
is html and they deliberately line-break to make 'rawbody' rules useless),
all the spammers have to do is break every email at a different letter
position and it becomes virtually impossible to write a rule that can catch
the pattern. Spammers do this.

The solution in English might be if there was a way to have text with all
spaces eliminated.  Michael is of course violently opposed to such an idea
and believes that obfuscated rules are the correct solution.  I disagree
with his position as an absolute; but that is beside the point here.

In oriential languages where you normally *do not have* wordbreaks, having
them inserted arbitrarily simply due to line wrapping is merely obfuscation,
and maintaining that is merely an attempt to make it harder or impossible to
write ruels that will catch spam.  At the minimum, it would seem that the
parsing code for oriential languages needs to do the equivalent of tr/\n//g
to eliminate the linebreaks.

However, this merely ends up with stringofletters.  Since this really does
represent words, and there are techniques to (seemingly relatively well)
decompose that into string of words, it seems to me a rational person might
want to actually write rules based on words rather than stringofletters.
Without a tokenizer this would not be possible.

A simplistic way of looking at it might be this: ask yourself if the \b
token has any value in spam rules.  If the answer is 'yes', ask yourself how
it will work with no breaks between words, or worse, with arbitrary breaks
between AND WITHIN words.


> Currently, Bayes is the only code that actually *uses* knowledge of how a
> string is tokenized into words; this isn't exposed to the rules at all.

This isn't even slightly true!  Virtually every rule written against English
spam is in some way concerned with word breaks.  In some cases in
obfuscation rules the rule may be concerned with ignoring word breaks.  In
many cases like /you have already won!/i there are implicit word breaks in
the rule.  Other rules use \b to require word breaks and prevent erroeous
matches.  If breaks were completely arbitrary, the language would be nigh
unto unreadable, and virtually all existing rules would fail!


> I don't understand these paragraphs :(

I was saying that the tokenizer ought to be early in the path, but should be
avoided when not needed or appropriate.

        Loren

Re: Charset normalization issue (report, patch, and request)

Reply via email to