Chinese (and pretty much all asian languages) will never be caught well with DSPAM. This is because the typical asian language does not use spaces for word boundaries (and, doesn't really have any identifiable word boundaries). When DSPAM tries to train an asian language message, it basically trains the entire message as a single token and therefore, will only catch that specific message again.
Now, it's not all bad, a lot of the techniques that make DSPAM effective against image spam can be used, since in both cases you're really just working off the mail headers. So, the moral of the story is that you'll get some kind of accuracy using DSPAM but it won't be anywhere near what it gets on space delimited languages (english, russian). If you're interested in using a bayesian engine for asian languages, I would suggest checking out Spambayes. It's not as mature and won't give you the performance or accuracy on space delimited languages that DSPAM does, but it does utilize bigrams that should theoretically give you much better accuracy on asian languages. On 1/26/07, Odhiambo Washington <[EMAIL PROTECTED]> wrote:
* On 26/01/07 19:38 +0800, Kent Tong wrote: | Odhiambo Washington wrote: | >* On 26/01/07 17:41 +0800, Kent Tong wrote: | >| Hi, | >| | >| I'm pilot testing dspam and is training it. It detects spam in English | >| or Russian quite well, but it almost always fails to detects spam in | >| Chinese. I've fed it with about 2,000 ham in my mail box and corrected | >| about 200 missed spams (false positive), but it doesn't seem to be | >| improving. | >| | >| Does anyone have good experience with Chinese spam? | > | >How much spam (esp chinese) have you trained it with? | | Below is the stats: | | dspam_stats -H [EMAIL PROTECTED] | [EMAIL PROTECTED]: | TP True Positives: 648 | TN True Negatives: 290 | FP False Positives: 0 | FN False Negatives: 202 | SC Spam Corpusfed: 89 | NC Nonspam Corpusfed: 1359 | TL Training Left: 851 | SHR Spam Hit Rate 76.24% | HSR Ham Strike Rate: 0.00% | OCA Overall Accuracy: 82.28% | | Among those false negatives, at least 50% are Chinese. So, at least 100 | Chinese spam have been fed to dspam as errors. Train DSPAM! Train!!! It's not a human being. That is why training is required. -Wash http://www.netmeister.org/news/learn2quote.html DISCLAIMER: See http://www.wananchi.com/bms/terms.php -- +======================================================================+ |\ _,,,---,,_ | Odhiambo Washington <[EMAIL PROTECTED]> Zzz /,`.-'`' -. ;-;;,_ | Wananchi Online Ltd. www.wananchi.com |,4- ) )-,_. ,\ ( `'-'| Tel: +254 20 313985-9 +254 20 313922 '---''(_/--' `-'\_) | GSM: +254 722 743223 +254 733 744121 +======================================================================+ If all the world's economists were laid end to end, we wouldn't reach a conclusion. -- William Baumol
