Re: Bayes scoring
Daniel Lemke wrote: > > > andrij wrote: >> >> I run the bayes classifier on more than 4500 e-mails. All (except of cca >> 100 e-mails) contained test=BAYES_*. Does anybody have any idea why these >> 100 e-mails were not scored by the bayes classifier? >> > > Do you have any shortcircuit enabled? > No. I am playing with Bayes and RelayCountry plugins. I have enabled only Bayes, RelayCountry, Check plugins and Bayes rules. Daniel Lemke wrote: > > Could you post a raw example of one of those mails, not scored by bayes? > I cannot, I should ask the owner of the e-mails. I tried with databases of spam and ham e-mails. What is interesting it happened only to the database of ham emails. -- View this message in context: http://old.nabble.com/Bayes-scoring-tp29324885p29325278.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Bayes scoring
Hi all, I run the bayes classifier on more than 4500 e-mails. All (except of cca 100 e-mails) contained test=BAYES_*. Does anybody have any idea why these 100 e-mails were not scored by the bayes classifier? At http://www.paulgraham.com/spam.html, it is written that "When new mail arrives, it is scanned into tokens, the most interesting fifteen tokens, ..., are used to calculate the probability that the mail is spam". How many tokens are used by the SA's bayes classifier to calculate the probability that the mail is spam/ham? Thanks a lot. -- View this message in context: http://old.nabble.com/Bayes-scoring-tp29324885p29324885.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
DB tokens expiration
Hi all, after I trained the bayes classifier with several thousands of e-mails I run "sa-learn --dump magic" and I got the following: 0.000 0 3 0 non-token data: bayes db version 0.000 0 5367 0 non-token data: nspam 0.000 0 3792 0 non-token data: nham 0.000 0 344519 0 non-token data: ntokens 0.000 0 847133240 0 non-token data: oldest atime 0.000 0 1274448689 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1280569532 0 non-token data: last expiry atime 0.000 02764800 0 non-token data: last expire atime delta 0.000 0 196817 0 non-token data: last expire reduction count I have the default settings set, i.e., "bayes_expiry_max_db_size 15" and "bayes_auto_expire 1". Why was the number of ntokens not reduced to 15? "last expiry atime" is greater than "newest atime". Does it mean that reduction is just going to occur? The "last expire reduction count" means that in time 1280569532 the number of tokens will be reduced by 196817? If I do not add any new token (so the "newest atime" will not change) the reduction will never occur? Thank you. -- View this message in context: http://old.nabble.com/DB-tokens-expiration-tp29324703p29324703.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: RelayCountry plugin
RW-15 wrote: > >> Does Bayes learn the tokens from the X-Spam-Relay-Country header? > > Contrary to popular belief, the country codes are not used by Bayes. > >> I think that it does not, because all headers "X-Spam-" are removed >> before learning, right? > > That's not the reason. The plugin does make the data available to Bayes > as the metadata from which X-Spam-Relay-Country is created, > That will work fine for the scoring phase - spamassassin processes an e-mail with the RelayCountry plugin, the RelayCountry plugin stores country information (internally) in metadata, then the bayes classifier uses these metadata to score the e-mail. Right? Are these metadata stored (permanently) within a processed e-mail? I am asking about that because of the following scenario. Let's say I have databases of spam and ham e-mails, which were no processed with the RelayCountry plugin. If I run sa-learn, will these e-mails be processed with the RelayCountry plugin before being tokenized? I assume that not (am I right?). Hence, I firstly need to process the databases with RelayCountry plugin, and then use sa-learn to train the Bayes classifier. However, if the metadata (from the first step) are not stored permanently within the emails, the Bayes classifier will not learn these metadata, right? RW-15 wrote: > > but it consists entirely of two letter country-codes, and Bayes doesn't > tokenize anything under 3 characters. > Thank you for this very useful information and the patch! Is it not enough just to add something to the country code in RelayCountrly.pm to make it longer, like "$cc = "Code" . $cc;"? -- View this message in context: http://old.nabble.com/RelayCountry-plugin-tp29284940p29295643.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
RelayCountry plugin
Hi all, I am playing with RelayCountry plugin. I have a small database of e-mails. I processed these emails with RelayCountry plugin, so every email contains X-Spam-Relay-country header (and corresponding countries). Now I want to train Bayes with these emails. Does Bayes learn the tokens from the X-Spam-Relay-Country header? I think that it does not, because all headers "X-Spam-" are removed before learning, right? How to tell SA to put X-Relay-Country (which I assume will not be removed during training phase) instead of X-Spam-Relay-Country? Thank you. -- View this message in context: http://old.nabble.com/RelayCountry-plugin-tp29284940p29284940.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Bayes classifier
Bowie Bailey wrote: > 3) Evaluating whether an email is spam or not. Does the bayes classifier analyze headers if I have, for example, the following rule: "body BAYES_05 eval:check_bayes('0.00', '0.05')". According to the http://wiki.apache.org/spamassassin/WritingRules : "Body rules also include the Subject as the first line of the body content". So, any headers that precede subject header are not considered by the bayes classifier? >>> I don't have an answer for you here, but just another question. Why do >>> you want to mess with the bayes rules? >> Maybe I am mistaken, but what is the sense to train the bayes classifier >> on >> headers if headers (at least those that precede a subject header) are not >> considered during the spam detection phase? > > Bayes learns based on the entire message -- headers and all. > (Otherwise, what would be the point of the bayes_ignore_header option?) > > I can see where you might get that impression by looking at the rule, > but if I understand it correctly, Bayes has already been run and the > rule is just checking the result. > Thank you for the clarifying. The word "body" at the begining of the rule confused me. So, in general it does not matter what word ("body" or "header") is put there -- the Bayes clasifier analyzes both headers (except those introduced by bayes_ignore_header) and body during both learning and scoring phases. Right? -- View this message in context: http://old.nabble.com/Bayes-classifier-tp29264841p29269574.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Bayes classifier
>> 2) Evaluating whether an email is spam or not. Again, if I set >> "bayes_ignore_header Some-header", will the bayes classifier ignore the >> header while evaluating an e-mail? > > Yes. That's what it's for. > So, the bayes clasifier will ignore "Some-header" in both learning and spam detection phases. Did I understand it correctly? >> 3) Evaluating whether an email is spam or not. Does the bayes classifier >> analyze headers if I have, for example, the following rule: "body >> BAYES_05 >> eval:check_bayes('0.00', '0.05')". According to the >> http://wiki.apache.org/spamassassin/WritingRules : "Body rules also >> include >> the Subject as the first line of the body content". So, any headers that >> precede subject header are not considered by the bayes classifier? > > I don't have an answer for you here, but just another question. Why do > you want to mess with the bayes rules? > Maybe I am mistaken, but what is the sense to train the bayes classifier on headers if headers (at least those that precede a subject header) are not considered during the spam detection phase? Thank you. -- View this message in context: http://old.nabble.com/Bayes-classifier-tp29264841p29266978.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Bayes classifier
Hi all, I am new to spamassassin and bayes classifier. I have several questions and I will greatly appreciate your help with that. 1) Training of the bayes classifier with _multipart_ e-mails (e.g., an e-mail contains other e-mails within its body). If I set "bayes_ignore_header Some-header", will bayes classifier ignore (while learning) the header "Some-header" in the nested messages as well? 2) Evaluating whether an email is spam or not. Again, if I set "bayes_ignore_header Some-header", will the bayes classifier ignore the header while evaluating an e-mail? 3) Evaluating whether an email is spam or not. Does the bayes classifier analyze headers if I have, for example, the following rule: "body BAYES_05 eval:check_bayes('0.00', '0.05')". According to the http://wiki.apache.org/spamassassin/WritingRules : "Body rules also include the Subject as the first line of the body content". So, any headers that precede subject header are not considered by the bayes classifier? Thanks for the help. -- View this message in context: http://old.nabble.com/Bayes-classifier-tp29264841p29264841.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.