On 04/27/2013 10:59 AM, Jari Fredriksson wrote:
27.04.2013 04:54, Karsten Bräckelmann kirjoitti:
And it is good advice to keep the initial training corpora to a
ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
point, we're approaching woodoo. Learning 10 times more ham than spam is
most likely to be a bad choice, though.)
I don't see any problem with having a corpus like this:
0.000 0 28252 0 non-token data: nspam
0.000 0 187579 0 non-token data: nham
I have no problems with Bayes whatsoever.
how many users? domains?
Can hardly be a heavily spammed setup or it would look more like:
0.000 0 7762525 0 non-token data: nspam
0.000 0 4171794 0 non-token data: nham
(a week's worth of tokens)