Reindl Harald [mailto:h.rei...@thelounge.net] wrote:
> > However, that doesn't happen. > > 0.000 0 338770 0 non-token data: nspam > > 0.000 0 1460807 0 non-token data: nham > what do you expect when you train 4 times more ham than spam? > frankly you "flooded" your bayes with 1.4 Mio ham-samples and i thought > our 140k total corpus is large - don' forget that ham messages are > typically larger than junk trying to point you with some words to a URL > > 108897 SPAM > 31492 HAM This is a production mail gateway serving since 2015. I saw that a few messages (both hams and spams) automatically learned by amavisd/spamassassin. Today's statistics: 3616 autolearn=ham 10076 autolearn=no 2817 autolearn=spam 134 autolearn=unavailable I think I have no control over what is learnt automatically. Let's just assume for a moment that 1.4M ham-samples are valid. Is there a ham:spam ratio I should stick to it? I presume if we have a 1:1 ratio then future messages won't be considered as spam as well. Regards Szabolcs