Re: Public SA Corpus

2004-10-12 Thread Thomas Bolioli
Gerry Doris wrote:
I managed to destroy my bayes database...don't ask.
Since I only run a home system and don't receive a heavy flow of spam I
really like to skip the wait for bayes to get up to speed.  Is it
recommended to use the public corpus on the SA website or is it too old
for proper training?  Is there a better source of ham/spam to be used for
training?
Gerry
 

The public spam db should be broad enough for you in the interim,
although I just checked and it is a little long in the tooth (circa
2/2003). Spam is in large part generic these days, public/generic could
get you up and going quick. As time goes by, the older spam will be
retired and be replaced with things coming in. Don't bother with public
ham though. Feeding it ham should be up to you. If you get that little
spam, then you should have no problem training it on that side.
On a side note, I have a 55K message spam database from email addresses
used in the music industry, environmental and educational markets (not
to mention /. ;-}) and should be a broad reach. It has been culled of
all virii and mailing list mail. It could make a decent analysis corpus
for those who want it. Also gerry, If you want, I can forward along or
post the most recent spam, about 2-5K worth for you to train on. That
should be all you need.
Tom



RE: Public SA Corpus

2004-10-11 Thread Matthew.van.Eerde
Gerry Doris wrote:
> I managed to destroy my bayes database...don't ask.
> 
> Since I only run a home system and don't receive a heavy flow of spam I
> really like to skip the wait for bayes to get up to speed.  Is it
> recommended to use the public corpus on the SA website or is it too old
> for proper training?  Is there a better source of ham/spam to
> be used for training?

Bayes' effectiveness lies in its personalization.  There's little value in 
training it on a public corpus.
I assume you didn't back up your Bayes database... and that you've deleted all 
your spam so you can't retrain it...
Best thing to do is just wait for it to rebuild itself.  The default spam/ham 
counter values are 200 spam and 200 ham.  I believe you can tweak these if you 
REALLY want to start auto-tagging sooner... but you void the warranty if you 
do...


Public SA Corpus

2004-10-11 Thread Gerry Doris
I managed to destroy my bayes database...don't ask.

Since I only run a home system and don't receive a heavy flow of spam I
really like to skip the wait for bayes to get up to speed.  Is it
recommended to use the public corpus on the SA website or is it too old
for proper training?  Is there a better source of ham/spam to be used for
training?


Gerry