Am 19.03.2015 um 20:35 schrieb RW:
On Thu, 19 Mar 2015 01:12:15 +0100 Reindl Harald wrote:Am 19.03.2015 um 00:54 schrieb RW:This is nothing to do with auto-learning. There is a difference between miss-training and training with spam that contains so-called "Bayes poison". Bayes is best trained on what is in real-world spam, not what we would prefer that spammers put in spamit's the same - it is exactly the same and it is not a matter "what we would prefer that spammers put in spam" but what they put *additional* to it to ruin bayes and filter resultsThey don't put it there to ruin Bayes, they don't care about FP rates, they put it there so their spam can take advantage of what they guess has been trained as ham.
no, both of ittests over 15000 spam examples prove that after remove poision, rebuild bayes from the cleaned corpus and verify the original messages still BAYES_99 for all of them
but it affects your ham and so FP rates over the time
I was just looking at my recent spam and Bayes-poison seems less common than it used to be, but these things come in cycles.
as most spam comes in cycles, hence auto expire is wronganalyzing 15000 spam samples showing that *identical* messages sometimes contains poison and sometimes don't
the effect is visible: * BAYES_00 hits are more than before * BAYES_50 hits for ham are less than before * ANY of the cleaned messages have still BAYES_99 and most BAYES_999 the last point is easy to prove by having the old, unmodified corpus and run spamc against the cleaned bayes database and the final result is that you stop training in circles because you need a ton of classified ham messages to reduce the pision impactBut you're testing mail that's already been trained into the database. Even though you stripped the "Bayes-poison" when training, you'll have left enough rare tokens from the headers and elsewhere to effectively "fingerprint" that spam. It's pretty much inevitable that it hits BAYES_99[9].
you didn't get what i wrote * i removed poision and rebuilt bayes * i verfied the *original* junk still containing poision aginst the new bayes because i am not an idiot to verify cleaned samples against a bayes built of the same contents
if you have users from all over the world speaking different languages the effect of bayes poisioning get much more visible because it contains random words in al sort of languages and you don't have enough ham to reduce that damageIt sounds like you haven't learned enough. FWIW I do learn "Bayes-poison" and still have >99% of ham hitting BAYES_00. The figure has been rising over the years.
may depend on your mailflow and some luck
believe it or not - my goal is to train a bayes database once and have a sane system over many many years - what i read often is "spam samples become outdated and so you need to restart" - no they don't,You seem to be relying on most ham hitting BAYES_00, so the rest of the mail can be treated very aggressively. This probably does make you less reliant on an up-to-date spam corpus
which is the goal: not training day for day in circles because neding more and more ham samples to balance out parts never should have been trained as spam at all
signature.asc
Description: OpenPGP digital signature