on Sun Jul 29 2007, skip-AT-pobox.com wrote:
> Amedee> I have the same experience:
>
> Amedee> [EMAIL PROTECTED] { ~ }$ ./spamstats
> Amedee> Spam: 2415 Ham: 651
>
> Amedee> That's 3.7:1, and it's increasing.
>
> One of the reasons I can keep a nearly 1:1 ratio is that when it gets a bit
> out of whack I simply delete some old spam. In my experience the nature of
> spam changes over time while the nature of ham rarely does. I also use
> train-to-exhaustion which only trains in fixed ratios.
No longer. There's the --unbalanced option. Also, I've been using
this very simple patch, which, instead of insanely barreling ahead
with the ratio specified even if the corpora are closer to 1:1,
reverts using to the ratio in the corpora. Thus the ratio parameter
becomes a ratio /limit/ and, along with using --reverse, the oldest
spam that falls outside the limit tend to be ignored.
Index: tte.py
===================================================================
--- tte.py (revision 3156)
+++ tte.py (working copy)
@@ -114,10 +114,11 @@
hambone_ = list(reversed(hambone_))
spamcan_ = list(reversed(spamcan_))
+ nspam,nham = len(spamcan_),len(hambone_)
if ratio:
rspam,rham = ratio
- else:
- rspam,rham = len(spamcan_),len(hambone_)
+ if (rspam > rham) == (rspam * nham > rham * nspam):
+ rspam,rham = nspam,nham
# define some indexing constants
ham = 0
--
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com
The Astoria Seminar ==> http://www.astoriaseminar.com
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html