On 04/27/2013 10:59 AM, Jari Fredriksson wrote:
27.04.2013 04:54, Karsten Bräckelmann kirjoitti:
And it is good advice to keep the initial training corpora to a
ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
point, we're approaching woodoo. Learning 10 times more ham than spam is
most likely to be a bad choice, though.)
I don't see any problem with having a corpus like this:

0.000          0      28252          0  non-token data: nspam
0.000          0     187579          0  non-token data: nham

I have no problems with Bayes whatsoever.

how many users? domains?
Can hardly be a heavily spammed setup or it would look more like:

0.000          0    7762525          0  non-token data: nspam
0.000          0    4171794          0  non-token data: nham
(a week's worth of tokens)




Reply via email to