On Thu, 19 Mar 2015 01:12:15 +0100 Reindl Harald wrote: > Am 19.03.2015 um 00:54 schrieb RW:
> > This is nothing to do with auto-learning. There is a difference > > between miss-training and training with spam that contains > > so-called "Bayes poison". Bayes is best trained on what is in > > real-world spam, not what we would prefer that spammers put in spam > > it's the same - it is exactly the same and it is not a matter "what > we would prefer that spammers put in spam" but what they put > *additional* to it to ruin bayes and filter results They don't put it there to ruin Bayes, they don't care about FP rates, they put it there so their spam can take advantage of what they guess has been trained as ham. I was just looking at my recent spam and Bayes-poison seems less common than it used to be, but these things come in cycles. > if you train only manually reviewed messages and don't recognize > hidden poision often three times more than the visible part up to > additonal mime-parts dedicated for poison with diffrent crap at the > end of the plain-alternate as well as in hidden alyers, span-tags and > div-tags *excatly* the same happens for auto-learning It actually ignores most of that. > the point is "Bayes is best trained on what is in real-world spam" > but not with if the spam content is only a small part of the message > because you train at the same time innconect parts as spam But at delivery, all of the text will be scanned, not just the spam content. > the effect is visible: > > * BAYES_00 hits are more than before > * BAYES_50 hits for ham are less than before > * ANY of the cleaned messages have still BAYES_99 and most BAYES_999 > > the last point is easy to prove by having the old, unmodified corpus > and run spamc against the cleaned bayes database and the final result > is that you stop training in circles because you need a ton of > classified ham messages to reduce the pision impact But you're testing mail that's already been trained into the database. Even though you stripped the "Bayes-poison" when training, you'll have left enough rare tokens from the headers and elsewhere to effectively "fingerprint" that spam. It's pretty much inevitable that it hits BAYES_99[9]. > if you have users from all over the world speaking different > languages the effect of bayes poisioning get much more visible > because it contains random words in al sort of languages and you > don't have enough ham to reduce that damage It sounds like you haven't learned enough. FWIW I do learn "Bayes-poison" and still have >99% of ham hitting BAYES_00. The figure has been rising over the years. > believe it or not - my goal is to train a bayes database once and > have a sane system over many many years - what i read often is "spam > samples become outdated and so you need to restart" - no they don't, You seem to be relying on most ham hitting BAYES_00, so the rest of the mail can be treated very aggressively. This probably does make you less reliant on an up-to-date spam corpus.