Hello Christopher,

Friday, August 27, 2004, 6:32:48 PM, you wrote:

CB> Hi List,

CB> I installed bayesit 0.5.5 about a week ago. I trained it up
CB> on the folder of junk mail I had collected, and on the non-junk.
CB> Then every morning, I do a manual 'mark as junk'. To date, my junk
CB> folder contains 1667 messages. The total from my non-junk folders
CB> is about 3000. 

CB> Here is some info from the plugin:

<snips info>

I think you've not trained it well enough on what ISN'T spam.

When I first trained BayesIt, it still had the ability to scan folders
when installed. RitLabs asked for this to be removed, as you can't
scan folders in SecureBat! that way - and they wanted the BayesIt
plugin to work for both The Bat! and SecureBat.

So now you have to train it yourself. This, IMHO, is not made clear.

What's made even less clear is that BayesIt needs to know both what's
spam and non-spam. Before you train it, you need to make a good effort
to clean up all your folders of spam - move any spam to a makeshift
junk folder. Then go into each one "non junk" folder, and train
BayesIt on which mail ISN'T spam. Then train it on what is spam from
your makeshift junk folder by marking all its contents as spam.

I even marked my Sent mails as not-junk when training, on the
egotistical presumption that nobody writes the kind of email I'd like
to recieve better than myself! *grins*

OK, so that's not quite true - my emails can be bad. But they do
contain the kind of keywords that I wouldn't mind recieving, so they
do make good stuff for marking as non-junk.

Remember that BayesIt works statistically. It compares the contents of
messages to both what your definition of "normal" mail is AND your
definition of spam mail. Without both definitions, you may not get as
good a set of results. I didn't realise this at first - I just marked
mail as junk. Only when I marked mail as non-junk did I get anywhere -
filtering then worked just fine!


Here are my statistics for the BayesIt plugin:

  BayesIt! filter information
  Antispam filtering data:

  Spam frequency dictionary:
    * C:\...\spamdict.bye
    * Size: 15376 letters.
    * Capacity: 390982 words.
  Non-spam frequency dictionary:
    * C:\...\nspamdict.bye
    * Size: 24251 letters.
    * Capacity: 451531 words.
  Current active base:
    * Active current base contains 137896 words.
    * Status: OK

Note that my figures for non-spam appear "stronger". I now get very
little spam sneaking past BayesIt - just the newish ASCII art spams,
and the classic "empty HTML message with a picture of the text". Both
are understandably hard to filter, so I have no problems with this. I
get one a day at most anyway. ;-)

One caveat with this explanation, though - I recieve much more
"legitimate" email than spam, because I'm on mailing lists like this
one. If I were on no lists and got very little legitimate email, I
suppose the stats could conceivably be the other way around. You want
the stats to at least reflect the general direction of the
spam/non-spam ratio in your mail flow, I suppose...

CB> Bayesit has yet to find a junk message on its own. The
CB> 'general numbers' are obviously wrong...Is that the problem? Do
CB> the dictionary sizes look right?

I can't possibly say for certain, but they don't look like mine. And
my BayesIt plugin is working satisfactorily. Therefore, I can only
humbly suggest that you aspire to match my figures, using the methods
I have outlined above. ;-)


-- 
Best regards,
 Philip                            mailto:[EMAIL PROTECTED]

Using The Bat! v2.12.00 on Windows XP 5.1 Build  2600
Service Pack 1

Attachment: pgpeP66nNmbZV.pgp
Description: PGP signature

________________________________________________
Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html

Reply via email to