Am 13.05.2016 um 23:08 schrieb Tom Hendrikx:
On 13-05-16 18:29, Reindl Harald wrote:
especially you would not have much from the bayes-samples because they
would trigger all sort of wrong rules after strip most headers and and a
generic received header (which seems to be needed by the bayes-engine
for whatever reason since it otherwise scores samples completly different)

This is an assumption: you can't know what your data would contribute to
the masscheck process

this is *not* an assumption - the setup is maintained in a way that i don't have to make many assumptions at all

i run tools for corpus-files and downloads to pass them through SA and see regulary all sort of rules hit on stripped samples which would not hit on the untouched email

guess what remains with a 2292 lines "bayes_ignore_header" which is also used to strip messages with formail compared to the original ones

the reason is that we maintain a real huge bayes which is intended only to contain body and a few headers, otherwise 90000 samples would not only take 800 MB stoarge and result "only" 2818486 token

why?

because we keep samples and bayes forever while train every spam message below BAYES_99 and every ham message >= BAYES_50 to keep the option rebuild from scratch at any point in time (tokenizer-changes in future versions, maybe more-word-tokens in future versions or if needed switch to a different solution without start collect from scratch)

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to