https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7135
Bug ID: 7135
Summary: Bayes tokenizer 'arbitrarily' breaks multibyte CJK
utf-8 characters into digrams instead of breaking on
UTF-8 character boundaries
Product: Spamassassin
Version: 3.4.0
Hardware: All
OS: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Plugins
Assignee: [email protected]
Reporter: [email protected]
Observing the 'bayes: token' debug logging on mail messages written
in far-Eastern character sets, the log often reports a multitude of
entries like:
bayes: token '8:i�' => 0.00795...
The code section in Bayes.pm that does this is:
if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
# Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
# but I'm doing tuples to keep the dbs small(er)." Sounds like a plan
# to me! (jm)
while ($token =~ s/^(..?)//) {
push (@rettokens, "8:$1");
}
next;
}
So it seems that 3- or 4-byte UTF-8 sequences representing
characters like CJK or special punctuation are just 'arbitrarily'
chopped in pairs regardless of boundaries between characters.
So for example the last octet of a previous character can form
a pair with the first octet of the next character. Or an arbitrary
pair of subsequent octets (a substring) of a 3- or 4-byte UTF-8
encoding of a single character is considered a token.
This seems far from ideal. It's like taking pairs of bytes from
Base64 encoding and hoping to get a good representation of the
original encoded message.
So I'm suggesting to add the following code section just before
the code section mentioned above:
if (TOKENIZE_LONG_8BIT_SEQS_AS_UTF8_CHARS && $token =~ /[\x80-\xBF]{2}/) {
# only collect 3- and 4-byte UTF-8 sequences, ignore 2-byte sequences
my(@t) = $token =~ /( (?: [\xE0-\xEF] | [\xF0-\xF4][\x80-\xBF] )
[\x80-\xBF]{2} )/xsg;
if (@t) {
push (@rettokens, map('u8:'.$_, @t));
next;
}
}
It only collects valid 3- or 4-octet UTF-8 characters from long
tokens containing 8-bit characters - very much like the original
code section does, but observes character boundaries.
This covers characters from CJK character sets, punctuation
characters, Euro symbol, etc, but does not trigger on Western
character sets which are mostly represented a 2-byte UTF-8
sequences.
If there are no valid long UTF-8 bytes sequences found, it falls
back to existing code which just chops string into byte pairs.
--
You are receiving this mail because:
You are the assignee for the bug.