I looked through the source and it looks like that NaN value in the
header is calculated as 
   (p / (p + np))
where p  ==> [p *= (token rating)]
      np ==> [1.0 - (token rating)]

which to me, indicates a token rating outside of the 0.0 - 1.0 range,
happening during the training period. Sounds weird - you may want to
enable debug, rebuild James, and re-process some of your spam emails.
For debug I would capture the value put into the map in
addTokenOccurances() around line 400:

                target.put(token, value);

It is also interesting that you said all of your headers contain the
same header values

> X-Spam-Score: -2.6
> X-Spam-Report: -2.6 BAYES_00               BODY: Bayesian spam
probability
> is 0 to 1%
>         [score: 0.0000]

and if these exist during the Spam training, i.e. 100% of your example
Spam emails contain these tokens, perhaps this is tainting the token
ratings?

Idea + wild guess = HTH

Kent


-----Original Message-----
From: David Legg [mailto:david.l...@searchevent.co.uk] 
Sent: Tuesday, February 10, 2009 1:57 PM
To: James Users List
Subject: Re: How do I reduce SPAM


> Here is what I see as well;  ( on ALL messages)
>
> X-MessageIsSpamProbability: NaN
> X-MessageIsSpam: true
>   

Mmm... Ok.  Well, as you may know 'NaN' is short for 'Not a Number' in
floating point speak.  So something has caused the spam probability
value to be such a large or small number that Java can't represent it.

I've seen one or two of my own messages with this value... but not all
of them.

Tell me... do your emails contain lots of images?  I've noticed in the
past that the Bayesian filter will quite happily chomp its way through
all the image data and treat it as if it were text.  If you had lots of
this type of email I could believe it might effectively poison the
corpus.

I'm beginning to clutch at straws now as I don't know what else to
suggest...  Anybody else got any ideas?

Regards,
David Legg


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscr...@james.apache.org
For additional commands, e-mail: server-user-h...@james.apache.org


Internal Virus Database is out of date.
Checked by AVG - http://www.avg.com
Version: 8.0.233 / Virus Database: 270.10.17/1932 - Release Date:
2/3/2009 7:57 AM

---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscr...@james.apache.org
For additional commands, e-mail: server-user-h...@james.apache.org

Reply via email to