Re: Bayes scoring

2010-08-02 Thread andrij


Daniel Lemke wrote:
> 
> 
> andrij wrote:
>> 
>> I run the bayes classifier on more than 4500 e-mails. All (except of cca
>> 100 e-mails) contained test=BAYES_*. Does anybody have any idea why these
>> 100 e-mails were not scored by the bayes classifier?
>> 
> 
> Do you have any shortcircuit enabled?
> 

No. I am playing with Bayes and RelayCountry plugins. I have enabled only
Bayes, RelayCountry, Check plugins and Bayes rules.


Daniel Lemke wrote:
> 
> Could you post a raw example of one of those mails, not scored by bayes?
> 

I cannot, I should ask the owner of the e-mails. I tried with databases of
spam and ham e-mails. What is interesting it happened only to the database
of ham emails.

-- 
View this message in context: 
http://old.nabble.com/Bayes-scoring-tp29324885p29325278.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Bayes scoring

2010-08-02 Thread andrij

Hi all,

I run the bayes classifier on more than 4500 e-mails. All (except of cca 100
e-mails) contained test=BAYES_*. Does anybody have any idea why these 100
e-mails were not scored by the bayes classifier?

At http://www.paulgraham.com/spam.html, it is written that "When new mail
arrives, it is scanned into tokens, the most interesting fifteen tokens,
..., are used to calculate the probability that the mail is spam". How many
tokens are used by the SA's bayes classifier to calculate the probability
that the mail is spam/ham?

Thanks a lot. 
-- 
View this message in context: 
http://old.nabble.com/Bayes-scoring-tp29324885p29324885.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



DB tokens expiration

2010-08-02 Thread andrij

Hi all,

after I trained the bayes classifier with several thousands of e-mails I run
"sa-learn --dump magic" and I got the following:

0.000  0  3  0  non-token data: bayes db version
0.000  0   5367  0  non-token data: nspam
0.000  0   3792  0  non-token data: nham
0.000  0 344519  0  non-token data: ntokens
0.000  0  847133240  0  non-token data: oldest atime
0.000  0 1274448689  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal sync
atime
0.000  0 1280569532  0  non-token data: last expiry atime
0.000  02764800  0  non-token data: last expire atime
delta
0.000  0 196817  0  non-token data: last expire
reduction count

I have the default settings set, i.e., "bayes_expiry_max_db_size 15" and
"bayes_auto_expire 1". 

Why was the number of ntokens not reduced to 15?

"last expiry atime" is greater than "newest atime". Does it mean that
reduction is just going to occur? 

The "last expire reduction count" means that in time 1280569532 the number
of tokens will be reduced by 196817?

If I do not add any new token (so the "newest atime" will not change) the
reduction will never occur?

Thank you.
-- 
View this message in context: 
http://old.nabble.com/DB-tokens-expiration-tp29324703p29324703.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: RelayCountry plugin

2010-07-29 Thread andrij


RW-15 wrote:
> 
>> Does Bayes learn the tokens from the X-Spam-Relay-Country header? 
> 
> Contrary to popular belief, the country codes are not used by Bayes.
> 
>> I think that it does not, because all headers "X-Spam-" are removed
>> before learning, right?
> 
> That's not the reason. The plugin does make the data available to Bayes
> as the metadata from which X-Spam-Relay-Country is created,
> 

That will work fine for the scoring phase - spamassassin processes an e-mail
with the RelayCountry plugin, the RelayCountry plugin stores country
information (internally) in metadata, then the bayes classifier uses these
metadata to score the e-mail. Right?

Are these metadata stored (permanently) within a processed e-mail? I am
asking about that because of the following scenario.

Let's say I have databases of spam and ham e-mails, which were no processed
with the RelayCountry plugin. If I run sa-learn, will these e-mails be
processed with the RelayCountry plugin before being tokenized? I assume that
not (am I right?). Hence, I firstly need to process the databases with
RelayCountry plugin, and then use sa-learn to train the Bayes classifier.
However, if the metadata (from the first step) are not stored permanently
within the emails, the Bayes classifier will not learn these metadata,
right? 


RW-15 wrote:
> 
> but it consists entirely of two letter country-codes, and Bayes doesn't
> tokenize anything under 3 characters.
> 

Thank you for this very useful information and the patch! Is it not enough
just to add something to the country code in RelayCountrly.pm to make it
longer, like "$cc = "Code" . $cc;"?
-- 
View this message in context: 
http://old.nabble.com/RelayCountry-plugin-tp29284940p29295643.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



RelayCountry plugin

2010-07-28 Thread andrij

Hi all,

I am playing with RelayCountry plugin. I have a small database of e-mails. I
processed these emails with RelayCountry plugin, so every email contains
X-Spam-Relay-country header (and corresponding countries). 

Now I want to train Bayes with these emails.

Does Bayes learn the tokens from the X-Spam-Relay-Country header? I think
that it does not, because all headers "X-Spam-" are removed before learning,
right?

How to tell SA to put X-Relay-Country (which I assume will not be removed
during training phase) instead of X-Spam-Relay-Country? 

Thank you.
-- 
View this message in context: 
http://old.nabble.com/RelayCountry-plugin-tp29284940p29284940.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: Bayes classifier

2010-07-26 Thread andrij



Bowie Bailey wrote:
> 
 3) Evaluating whether an email is spam or not. Does the bayes
 classifier
 analyze headers if I have, for example, the following rule: "body
 BAYES_05
 eval:check_bayes('0.00', '0.05')". According to the
 http://wiki.apache.org/spamassassin/WritingRules : "Body rules also
 include
 the Subject as the first line of the body content". So, any headers
 that
 precede subject header are not considered by the bayes classifier?
>>> I don't have an answer for you here, but just another question.  Why do
>>> you want to mess with the bayes rules?
>> Maybe I am mistaken, but what is the sense to train the bayes classifier
>> on
>> headers if headers (at least those that precede a subject header) are not
>> considered during the spam detection phase?
> 
> Bayes learns based on the entire message -- headers and all. 
> (Otherwise, what would be the point of the bayes_ignore_header option?)
> 
> I can see where you might get that impression by looking at the rule,
> but if I understand it correctly, Bayes has already been run and the
> rule is just checking the result.
> 

Thank you for the clarifying. The word "body" at the begining of the rule
confused me. So, in general it does not matter what word ("body" or
"header") is put there -- the Bayes clasifier analyzes both headers (except
those introduced by bayes_ignore_header) and body during both learning and
scoring phases. Right?

-- 
View this message in context: 
http://old.nabble.com/Bayes-classifier-tp29264841p29269574.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: Bayes classifier

2010-07-26 Thread andrij



>> 2) Evaluating whether an email is spam or not. Again, if I set
>> "bayes_ignore_header Some-header", will the bayes classifier ignore the
>> header while evaluating an e-mail?
> 
> Yes.  That's what it's for.
> 

So, the bayes clasifier will ignore "Some-header" in both learning and spam
detection phases. Did I understand it correctly?



>> 3) Evaluating whether an email is spam or not. Does the bayes classifier
>> analyze headers if I have, for example, the following rule: "body
>> BAYES_05
>> eval:check_bayes('0.00', '0.05')". According to the
>> http://wiki.apache.org/spamassassin/WritingRules : "Body rules also
>> include
>> the Subject as the first line of the body content". So, any headers that
>> precede subject header are not considered by the bayes classifier?
> 
> I don't have an answer for you here, but just another question.  Why do
> you want to mess with the bayes rules?
> 

Maybe I am mistaken, but what is the sense to train the bayes classifier on
headers if headers (at least those that precede a subject header) are not
considered during the spam detection phase?

Thank you.
-- 
View this message in context: 
http://old.nabble.com/Bayes-classifier-tp29264841p29266978.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Bayes classifier

2010-07-26 Thread andrij

Hi all,

I am new to spamassassin and bayes classifier. I have several questions and
I will greatly appreciate your help with that.

1) Training of the bayes classifier with _multipart_ e-mails (e.g., an
e-mail contains other e-mails within its body). If I set
"bayes_ignore_header Some-header", will bayes classifier ignore (while
learning) the header "Some-header" in the nested messages as well?

2) Evaluating whether an email is spam or not. Again, if I set
"bayes_ignore_header Some-header", will the bayes classifier ignore the
header while evaluating an e-mail?

3) Evaluating whether an email is spam or not. Does the bayes classifier
analyze headers if I have, for example, the following rule: "body BAYES_05
eval:check_bayes('0.00', '0.05')". According to the
http://wiki.apache.org/spamassassin/WritingRules : "Body rules also include
the Subject as the first line of the body content". So, any headers that
precede subject header are not considered by the bayes classifier?

Thanks for the help.
-- 
View this message in context: 
http://old.nabble.com/Bayes-classifier-tp29264841p29264841.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.