Am 19.03.2015 um 20:35 schrieb RW:
On Thu, 19 Mar 2015 01:12:15 +0100
Reindl Harald wrote:

Am 19.03.2015 um 00:54 schrieb RW:

This is nothing to do with auto-learning. There is a difference
between miss-training and training with spam that contains
so-called "Bayes poison".  Bayes is best trained on what is in
real-world spam, not what we would prefer that spammers put in spam

it's the same - it is exactly the same and it is not a matter "what
we would prefer that spammers put in spam" but what they put
*additional* to it to ruin bayes and filter results

They don't put it there to ruin Bayes, they don't care about FP rates,
they put it there so their spam can take advantage of what they guess
has been trained as ham.

no, both of it

tests over 15000 spam examples prove that after remove poision, rebuild bayes from the cleaned corpus and verify the original messages still BAYES_99 for all of them

but it affects your ham and so FP rates over the time

I was just looking at my recent spam and Bayes-poison seems less
common than it used to be, but these things come in cycles.

as most spam comes in cycles, hence auto expire is wrong

analyzing 15000 spam samples showing that *identical* messages sometimes contains poison and sometimes don't

the effect is visible:

* BAYES_00 hits are more than before
* BAYES_50 hits for ham are less than before
* ANY of the cleaned messages have still BAYES_99 and most BAYES_999

the last point is easy to prove by having the old, unmodified corpus
and run spamc against the cleaned bayes database and the final result
is that you stop training in circles because you need a ton of
classified ham messages to reduce the pision impact


But you're testing mail that's already been trained into the database.
Even though you stripped the "Bayes-poison" when training, you'll have
left enough rare tokens from the headers and elsewhere to effectively
"fingerprint" that spam. It's pretty much inevitable that it hits
BAYES_99[9].

you didn't get what i wrote

* i removed poision and rebuilt bayes
* i verfied the *original* junk still containing poision aginst
  the new bayes because i am not an idiot to verify cleaned samples
  against a bayes built of the same contents

if you have users from all over the world speaking different
languages the effect of bayes poisioning get much more visible
because it contains random words in al sort of languages and you
don't have enough ham to reduce that damage

It sounds like you haven't learned enough. FWIW I do learn
"Bayes-poison" and still have >99% of ham hitting BAYES_00. The figure
has been rising over the years.

may depend on your mailflow and some luck

believe it or not - my goal is to train a bayes database once and
have a sane system over many many years - what i read often is "spam
samples become outdated and so you need to restart" - no they don't,

You seem to be relying on most ham hitting BAYES_00, so the rest of the
mail can be treated very aggressively. This probably does make you less
reliant on an up-to-date spam corpus

which is the goal: not training day for day in circles because neding more and more ham samples to balance out parts never should have been trained as spam at all

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to