Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 14:32:36 -0600 (CST) sha...@shanew.net wrote: > I haven't checked the math in the Bayes plugin, but it explicitly > mentions using the "chi-square probability combiner" which is > described at http://www.linuxjournal.com/print.php?sid=6467 > > Maybe I'm misunderstanding what

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread shanew
On Thu, 15 Feb 2018, RW wrote: On Thu, 15 Feb 2018 11:56:55 -0600 (CST) sha...@shanew.net wrote: So, the sample size doesn't matter when calculating the probability of a message being spam based on individual tokens, but it can matter when we bring them all together to make a final

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 20:16:24 +0100 Reindl Harald wrote: > Am 15.02.2018 um 20:10 schrieb RW: > > I'm not saying that it doesn't matter how much you train, I'm saying > > that if you have enough spam and enough ham Bayes is insensitive to > > the ratio > > but not when the ratio differs in

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 11:56:55 -0600 (CST) sha...@shanew.net wrote: > On Thu, 15 Feb 2018, RW wrote: > > > As I said, Bayes is based on frequencies. > > > > If a token occurs in 10% of ham and 0.5% of spam based on 10,000 > > hams and 10,000 spams, what do you think is likely to happen to > >

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 19:24:14 +0100 Reindl Harald wrote: > Am 15.02.2018 um 19:20 schrieb RW: > > On Thu, 15 Feb 2018 17:15:47 +0100 > > You are talking about ultra-rare tokens here, the chances of these > > dominating a classification is negligibl > it is not - in 2015 i had to purge "in

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 17:15:47 +0100 Reindl Harald wrote: > Am 15.02.2018 um 17:01 schrieb RW: > > On Thu, 15 Feb 2018 00:01:18 +0100 > > Reindl Harald wrote: > > > >> Am 14.02.2018 um 23:07 schrieb RW: > > > >>> My point is that an imbalance doesn't create a bias > > > >> wrong -

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread shanew
On Thu, 15 Feb 2018, RW wrote: On Thu, 15 Feb 2018 00:01:18 +0100 Reindl Harald wrote: Am 14.02.2018 um 23:07 schrieb RW: My point is that an imbalance doesn't create a bias wrong - what you tried to say was "doesn't necessarily create a bias" - but in fact when the imbalance is too big

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 00:01:18 +0100 Reindl Harald wrote: > Am 14.02.2018 um 23:07 schrieb RW: > > My point is that an imbalance doesn't create a bias > wrong - what you tried to say was "doesn't necessarily create a bias" > - but in fact when the imbalance is too big *it does* > > simply

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread RW
On Wed, 14 Feb 2018 16:20:30 +0100 Matus UHLAR - fantomas wrote: > >On Tue, 13 Feb 2018 21:02:46 + > >Horváth Szabolcs wrote: > >> One more question: is there a recommended ham to spam ratio? 1:1? > > On 14.02.18 15:09, RW wrote: > >No, this is a myth. Bayes computes token probabilities

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread David Jones
On 02/14/2018 09:20 AM, Matus UHLAR - fantomas wrote: On Tue, 13 Feb 2018 21:02:46 + Horváth Szabolcs wrote: One more question: is there a recommended ham to spam ratio? 1:1? On 14.02.18 15:09, RW wrote: No, this is a myth.  Bayes computes token probabilities from a token's frequencies

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread Matus UHLAR - fantomas
On Tue, 13 Feb 2018 21:02:46 + Horváth Szabolcs wrote: One more question: is there a recommended ham to spam ratio? 1:1? On 14.02.18 15:09, RW wrote: No, this is a myth. Bayes computes token probabilities from a token's frequencies in spam and ham, so it all scales through. If you have

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread RW
On Tue, 13 Feb 2018 21:02:46 + Horváth Szabolcs wrote: > One more question: is there a recommended ham to spam ratio? 1:1? No, this is a myth. Bayes computes token probabilities from a token's frequencies in spam and ham, so it all scales through. If you have 2000 ham and 200 spam the

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread Rupert Gallagher
They cannot (do not want, do not have the know how) study the e-mails, and therefore they cannot build a reliable corpus. All they can do is to trust the ability of their users to study their own e-mails well enough to do the job, hence the mess with ham/spam when feeding the Bayesian filter.

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Bill Cole
On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote: This is a production mail gateway serving since 2015. I saw that a few messages (both hams and spams) automatically learned by amavisd/spamassassin. Today's statistics: 3616 autolearn=ham 10076 autolearn=no 2817 autolearn=spam 134

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin
On Tue, 13 Feb 2018, Horváth Szabolcs wrote: 3. populate the ham database That's the tricky part. As I mentioned earlier, I don't really want end-users involved in this. You might be able to find a few that are somewhat technically competent and don't mind their ham samples being manually

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Benny Pedersen
John Hardin skrev den 2018-02-14 02:28: Properly training your Bayes and increasing the score for BAYES_80, BAYES_95, and BAYES_99 and BAYES_999 score BAYES_999 5000 /me hiddes, could not resists :=)

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin
On Tue, 13 Feb 2018, David Jones wrote: Properly training your Bayes and increasing the score for BAYES_80, BAYES_95, and BAYES_99 and BAYES_999 is the best bet on this one. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Hello, David Jones [mailto:djo...@ena.com] wrote: > With non-English email flow, it's more challenging. If no RBLs hit, then you > really must train your Bayes properly which requires some way to accurately > determine the ham and spam. You must keep a copy of the ham and spam corpi and be

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones
On 02/13/2018 11:45 AM, Horváth Szabolcs wrote: Reindl Harald [mailto:h.rei...@thelounge.net] wrote: I think I have no control over what is learnt automatically. surely, don't do autolearning at all This is a mail gateway for multiple companies. I'm not supposed to read e-mails on that, or

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones
On 02/13/2018 11:24 AM, Horváth Szabolcs wrote: Hello, David Jones [mailto:djo...@ena.com] wrote: There should be many more rule hits than just these 3. It looks like network tests aren't happening. Can you post the original email to pastebin.com with minimal redacting so the rest of us can

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote: >> This is a mail gateway for multiple companies. I'm not supposed to read >> e-mails on that, or picking mails that can be used for learning ham > > how did you then manage 1.4 Mio ham-samples in your biased corpus Looks like in this

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote: >> I think I have no control over what is learnt automatically. > surely, don't do autolearning at all This is a mail gateway for multiple companies. I'm not supposed to read e-mails on that, or picking mails that can be used for learning ham.

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin
On Tue, 13 Feb 2018, Horváth Szabolcs wrote: After: pts rule name description -- -- 0.0 HTML_IMAGE_RATIO_08BODY: HTML has a low ratio of text to image area 0.0 HTML_MESSAGE BODY: HTML included

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Hello, David Jones [mailto:djo...@ena.com] wrote: > There should be many more rule hits than just these 3. It looks like > network tests aren't happening. > Can you post the original email to pastebin.com with minimal redacting > so the rest of us can run it through our SA to see how it

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones
On 02/13/2018 07:55 AM, Horváth Szabolcs wrote: Dear members, User repeatedly sends us spam messages to train SA. Traning - at the moment - requires manual intervention: administrator verifies if it's really spam then issues sa-learn. Then the user thinks the process is done, and the next

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote: > > However, that doesn't happen. > > 0.000 0 338770 0 non-token data: nspam > > 0.000 01460807 0 non-token data: nham > what do you expect when you train 4 times more ham than spam? > frankly you

Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Dear members, User repeatedly sends us spam messages to train SA. Traning - at the moment - requires manual intervention: administrator verifies if it's really spam then issues sa-learn. Then the user thinks the process is done, and the next time when the same email arrives, it will