27.04.2013 12:03, Axb kirjoitti: > On 04/27/2013 10:59 AM, Jari Fredriksson wrote: >> 27.04.2013 04:54, Karsten Bräckelmann kirjoitti: >>> And it is good advice to keep the initial training corpora to a >>> ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this >>> point, we're approaching woodoo. Learning 10 times more ham than >>> spam is >>> most likely to be a bad choice, though.) >> I don't see any problem with having a corpus like this: >> >> 0.000 0 28252 0 non-token data: nspam >> 0.000 0 187579 0 non-token data: nham >> >> I have no problems with Bayes whatsoever. > > how many users? domains? > Can hardly be a heavily spammed setup or it would look more like: > > 0.000 0 7762525 0 non-token data: nspam > 0.000 0 4171794 0 non-token data: nham > (a week's worth of tokens) > > > >
Only me for SPAM & HAM and my colleagues for spam. While I try and collect spam wherever I can, the amount of spam has been dropped big time during the couple of years. My boss seems to draw most of the spam of my sources ;) The ham "corpus" contains also many List-Id (mailing lists). That means they are included in my Bayes training, not in my ruleqa. And I do skim them thru, and move possible spam from them to my spam corpus (not to ruleqa though). -- For a light heart lives long. -- Shakespeare, "Love's Labour's Lost"
signature.asc
Description: OpenPGP digital signature