27.04.2013 12:03, Axb kirjoitti:
> On 04/27/2013 10:59 AM, Jari Fredriksson wrote:
>> 27.04.2013 04:54, Karsten Bräckelmann kirjoitti:
>>> And it is good advice to keep the initial training corpora to a
>>> ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
>>> point, we're approaching woodoo. Learning 10 times more ham than
>>> spam is
>>> most likely to be a bad choice, though.)
>> I don't see any problem with having a corpus like this:
>>
>> 0.000          0      28252          0  non-token data: nspam
>> 0.000          0     187579          0  non-token data: nham
>>
>> I have no problems with Bayes whatsoever.
>
> how many users? domains?
> Can hardly be a heavily spammed setup or it would look more like:
>
> 0.000          0    7762525          0  non-token data: nspam
> 0.000          0    4171794          0  non-token data: nham
> (a week's worth of tokens)
>
>
>
>

Only me for SPAM & HAM and my colleagues for spam. While I try and
collect spam wherever I can, the amount of spam has been dropped big
time during the couple of years. My boss seems to draw most of the spam
of my sources ;)

The ham "corpus" contains also many List-Id (mailing lists). That means
they are included in my Bayes training, not in my ruleqa. And I do skim
them thru, and move possible spam from them to my spam corpus (not to
ruleqa though).



-- 

For a light heart lives long.
                -- Shakespeare, "Love's Labour's Lost"


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to