Kent Tong wrote, on 30. jan 2007 15:55:
After:
1) disabling the "chain" feature.
2) scrapping the existing DB.
3) training it using hundreds of ham and hundreds of spam
simultaneously.
It becomes quite accurate (99%). However, it's too early to tell as
spamassassin is pre-classifying the mails. So dspam may be just
relying on the SPAM: tag inserted into the subject by spamassassin.
Quite. As one would expect. (For goodness sake leave SpamAssassin out of
the equation). I was on the point of commenting earlier, but trashed all
answers because so many "knew better" (there were a couple of rubbish
answers, such as all Asian language scripts using conjoined words, etc.,
which is patently not the case and has nothing to do with Chinese or
Japanese or your findings).
I can neither speak nor read Chinese (2 scripts) nor Japanese. But what
is obvious is (I've been told) that each "character" is a (beautifully
scribed) word. Now Merriam-Webster, the standard American English
dictionary, claims 249,000 *words* in its on-line dictionary.
Given that a word may consist of 1 to maximum umpteen (let's say 12)
characters, that would give a possible 249000^12 tokens =
56805732602641806617660372001000000000000000000000000000000000000
tokens (maths experts correct me), whilst, on the same basis, either
Chinese or Japanese would give 249000 tokens.
So what I was going to write was, that training dspam for Chines (or
Japanese) would need
228135472299766291637190249000000000000000000000000000000000 trainings
to get into the same league as Western characters. Even then presuming
that we're talking about the same Western character set.
That's not all of it, since no idiot spammer, whether Chinese, Japanese
or Western is going to use a vocabulary of 249000 words: More likely max
3000 or less including all occluded shit. But even so ...
--Tonni
--
Tony Earnshaw
Email: tonni at hetnet dot nl