>> The source distribution includes sb_dbexpimp.py, which can convert >> the database to/from a CSV file. > > it sounds like i should change the code and compile it and so one.
No, I didn't say anything about changing the code. You simply install Python, download the SpamBayes source, and run a command-line tool. > But i haveno such bis skiles in programming. Is there another > way to it? Not one that doesn't involve downloading Python & using the command line. It seems pretty unlikely that there's anyone that would need to see the raw token counts who is unable to use command-line tools. > Here is the Spam Clue of one of this email. > This Email had only the subject,Test Mail and noting else in the body > (but the email provider add automaticly a advertistment to the end > of the > email) What are you trying to achieve here? Unless you've trained on a lot of spam like this (i.e. "Test" or "Mail" in the subject and a ad-only body), SpamBayes isn't going to classify it as spam. (And, in fact, you have trained a message with the subject "Test Mail" as *ham*!). There are a few spammy clues in the message - particularly the urls, but there are many more ham clues, particularly: > token spamprob #ham #spam > 'auf' 0.0238095 9 0 > 'bei' 0.0348837 6 0 > 'noch' 0.0348837 6 0 > 'ein' 0.0505618 4 0 > 'gruss' 0.0918367 2 0 These have never been seen in spam, and were in the body of the message, so were presumably in the advertisement. If the advertisement doesn't change with each message, that means that you've never trained a message with the advertisement as spam - so SpamBayes is absolutely correct in classifying the message as ham. (As an aside: SpamBayes was created, for the most part, by English speakers. The process should still work in other white-space delimited languages, but there may be a few issues. For example, SpamBayes ignores any tokens that are fewer than 3 characters long - which includes 'worthless' English words like "a", "be", "to", "my", and so on. However, many of these words are longer in German, so perhaps performance would be better with a lower limit of 4 (or maybe too much useful information would be lost then). It would need experimentation to know for sure). > 'allein?' 0.155172 1 0 > 'aufnehmen' 0.155172 1 0 > 'beliebteste' 0.155172 1 0 > 'bye' 0.155172 1 0 > 'date!' 0.155172 1 0 > 'kontakt' 0.155172 1 0 > 'messagegruss' 0.155172 1 0 > 'schnell' 0.155172 1 0 > 'singles' 0.155172 1 0 > 'subject:Test Mail' 0.155172 1 0 > 'url:114845986687261' 0.155172 1 0 > 'url:11512' 0.155172 1 0 > 'url:singles' 0.155172 1 0 > 'warten' 0.155172 1 0 These have all been seen in a single ham message and no spam. They are enough (with the others) to counter the few URL spam clues. Does this make things any more clear? (I'm still not really sure what you are trying to do). =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
