Re: Can your bayes do this?

2016-01-21 Thread John Hardin

On Thu, 21 Jan 2016, RW wrote:


On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
John Hardin wrote:


There was an improvement in FP and FN from two tokens. The marginal
improvement from three doesn't seem worth it.


The improvement from 2 to 3 is more substantial than from 1 to 2

287/160 = 1.79

160/69  = 2.3


Ugh. I looked at the raw numbers rather than the ratio - sorry.

287/69 looks even better, 4.2


Whether any of this is worth it depends on a lot of things. I don't
think it's even obvious whether 3-word tokenization is more resource
intensive than 2-word. Clearly in the limit where ntokens goes to
infinity  3-word will outperform 2-word at the same database size,
which means that it can achieve the same level of performance with a
smaller database. I've no feeling for what value of ntokens that
switches around.


So it should be configurable, and if you change it you monitor token 
database size and scan times and FP/FN rate and adjust token expiry 
to manage, or switch it back to 1 if the improvement costs too much.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Maxim IV: Close air support covereth a multitude of sins.
---
 2 days until John Moses Browning's 161st Birthday


Re: Can your bayes do this?

2016-01-21 Thread Reindl Harald



Am 21.01.2016 um 20:38 schrieb RW:

On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
John Hardin wrote:



There was an improvement in FP and FN from two tokens. The marginal
improvement from three doesn't seem worth it.


The improvement from 2 to 3 is more substantial than from 1 to 2

  287/160 = 1.79

  160/69  = 2.3

Whether any of this is worth it depends on a lot of things. I don't
think it's even obvious whether 3-word tokenization is more resource
intensive than 2-word. Clearly in the limit where ntokens goes to
infinity  3-word will outperform 2-word at the same database size,
which means that it can achieve the same level of performance with a
smaller database. I've no feeling for what value of ntokens that
switches around


if SA would provide a param to add additional like 
"bayes_multiword_tokens " i could test it against 8 
messages with different  params and there is also a 700 entry 
long ignore-list for our daily check which could also be tested 
automatically if they swap over to BAYES_999 like the rest and all 
ham-samples still have BAYES_00


i run that tests every night against he whole corpus with a report to 
detected mis-training when previously as BAYES_999 or BAYES_00 
classified samples change their result


that's done with a dedicated SA-instance doing only bayes test and 
nothing else feeded by "spamc" and parsing the outputs, takes around 1 
hour on the current hardware



the exclude list can be checked with a param isolated and anything which 
reached BAYES_999 is automatically removed, looks like below (no the 
worker scripts are not runnining as root)


so the first test would fire that with 2,3,4 word-tokes and look how 
many samples chnage to BAYES_999 while no ham-samples from the large 
tests are lose their BAYES_00


i can clone that machine and re-build the whole bayes database from 
scratch within 15 minutes from the corpus files



[root@mail-gw:~]$ corpus-stats ignored
NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml

1 / 639 (SPAM: 2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml

2 / 639 (SPAM: 2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml

3 / 639 (SPAM: 2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml

4 / 639 (SPAM: 2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml

5 / 639 (SPAM: 2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml

6 / 639 (SPAM: 2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml

7 / 639 (SPAM: 2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml)




signature.asc
Description: OpenPGP digital signature


Re: Can your bayes do this?

2016-01-21 Thread RW
On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
John Hardin wrote:


> There was an improvement in FP and FN from two tokens. The marginal 
> improvement from three doesn't seem worth it.

The improvement from 2 to 3 is more substantial than from 1 to 2

 287/160 = 1.79

 160/69  = 2.3

Whether any of this is worth it depends on a lot of things. I don't
think it's even obvious whether 3-word tokenization is more resource
intensive than 2-word. Clearly in the limit where ntokens goes to
infinity  3-word will outperform 2-word at the same database size,
which means that it can achieve the same level of performance with a
smaller database. I've no feeling for what value of ntokens that
switches around.




Re: Can your bayes do this?

2016-01-21 Thread Reindl Harald


Am 21.01.2016 um 17:53 schrieb John Hardin:

On Thu, 21 Jan 2016, RW wrote:


On Thu, 21 Jan 2016 14:31:09 +0100
Christian Laußat wrote:


Am 21.01.2016 14:17, schrieb RW:

The FNs dropped from 287 to 69, which I'd call a four-fold
improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full
spam, so arguably it just did a better job in detecting the
embedded spam.


Yes, but is it really worth the resources? I mean, the database got
13 time larger for 3 word token, and with more words per token it
will grow exponentially.


But if you are training on error it only grows by a factor of 3.1
(13*69/287).  You also have to consider what happens if you simply
reduce the retention time by a factor of 3.1 - that corpus had 4 years
retention so it's unlikely that maintaining a constant size database
would have made much difference in this case. When you train from
corpus the database size is dominated by ephemeral tokens which makes
the situation look worse than it is.

It depends what you want. I don't care about an extra 100 MB
of disk space and a few milliseconds if it gives any measurable
improvement.

Personally I wouldn't like to see Bayes go multi-word because it would
likely end-up as a poor compromise. Two-word tokenization is the
default on DSPAM, but I've not seen anyone advocate using it. I think
it's better to score in an external filter that runs in addition to
Bayes.


There was an improvement in FP and FN from two tokens. The marginal
improvement from three doesn't seem worth it.

I'd like to see a SA Bayes config option to select between one-word and
two-word tokens



not only you!

like "bayes_token_sources all" was introduced a "bayes_multiword_tokens 
" would be perfect dsiabled by default, so one could easily 
verify the differences with a existing corpus and what's the best result


like the mime-tokens these should be additional ones to the in any case 
generated 1-word-tokens

_

for "Two-word tokenization is the default on DSPAM, but I've not seen 
anyone advocate using it" - just because it is a dead project, looking 
only at the bayes-implementation i have read more than once it's better 
then SA and the reason to not consider it was the fact it's dead and 
full of unfixed bugs




signature.asc
Description: OpenPGP digital signature


Re: Can your bayes do this?

2016-01-21 Thread John Hardin

On Thu, 21 Jan 2016, RW wrote:


On Thu, 21 Jan 2016 14:31:09 +0100
Christian Laußat wrote:


Am 21.01.2016 14:17, schrieb RW:

The FNs dropped from 287 to 69, which I'd call a four-fold
improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full
spam, so arguably it just did a better job in detecting the
embedded spam.


Yes, but is it really worth the resources? I mean, the database got
13 time larger for 3 word token, and with more words per token it
will grow exponentially.


But if you are training on error it only grows by a factor of 3.1
(13*69/287).  You also have to consider what happens if you simply
reduce the retention time by a factor of 3.1 - that corpus had 4 years
retention so it's unlikely that maintaining a constant size database
would have made much difference in this case. When you train from
corpus the database size is dominated by ephemeral tokens which makes
the situation look worse than it is.

It depends what you want. I don't care about an extra 100 MB
of disk space and a few milliseconds if it gives any measurable
improvement.

Personally I wouldn't like to see Bayes go multi-word because it would
likely end-up as a poor compromise. Two-word tokenization is the
default on DSPAM, but I've not seen anyone advocate using it. I think
it's better to score in an external filter that runs in addition to
Bayes.


There was an improvement in FP and FN from two tokens. The marginal 
improvement from three doesn't seem worth it.


I'd like to see a SA Bayes config option to select between one-word and 
two-word tokens.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Public Education: the bureaucratic process of replacing
  an empty mind with a closed one.  -- Thorax
---
 2 days until John Moses Browning's 161st Birthday

Re: Rule updates are too old - 2016-01-21

2016-01-21 Thread Axb

On 01/21/2016 05:42 PM, John Hardin wrote:

On Thu, 21 Jan 2016, dar...@chaosreigns.com wrote:


20160120:  Spam or ham is below threshold of 150,000:
http://ruleqa.spamassassin.org/?daterev=20160120
20160120:  Spam: 131777, Ham: 142710


Oooo, so close!


My spam levels are extremely low so I've increased my corpus' retention 
time and it's helping.

(till my masschecks are not delivered in the given time window :-)

With a bit of luck on Sat we'll have enough to push rules.




Re: Rule updates are too old - 2016-01-21

2016-01-21 Thread John Hardin

On Thu, 21 Jan 2016, dar...@chaosreigns.com wrote:


20160120:  Spam or ham is below threshold of 150,000:  
http://ruleqa.spamassassin.org/?daterev=20160120
20160120:  Spam: 131777, Ham: 142710


Oooo, so close!

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Public Education: the bureaucratic process of replacing
  an empty mind with a closed one.  -- Thorax
---
 2 days until John Moses Browning's 161st Birthday


Re: Can your bayes do this?

2016-01-21 Thread RW
On Thu, 21 Jan 2016 14:31:09 +0100
Christian Laußat wrote:

> Am 21.01.2016 14:17, schrieb RW:
> > The FNs dropped from 287 to 69, which I'd call a four-fold
> > improvement.
> > 
> > The FPs rose from 0 to 1, but that mail was ham quoting a full
> > spam, so arguably it just did a better job in detecting the
> > embedded spam.  
> 
> Yes, but is it really worth the resources? I mean, the database got
> 13 time larger for 3 word token, and with more words per token it
> will grow exponentially.

But if you are training on error it only grows by a factor of 3.1
(13*69/287).  You also have to consider what happens if you simply
reduce the retention time by a factor of 3.1 - that corpus had 4 years
retention so it's unlikely that maintaining a constant size database
would have made much difference in this case. When you train from
corpus the database size is dominated by ephemeral tokens which makes
the situation look worse than it is. 

It depends what you want. I don't care about an extra 100 MB
of disk space and a few milliseconds if it gives any measurable
improvement. 

Personally I wouldn't like to see Bayes go multi-word because it would
likely end-up as a poor compromise. Two-word tokenization is the
default on DSPAM, but I've not seen anyone advocate using it. I think
it's better to score in an external filter that runs in addition to
Bayes.



  


Fixed? Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Robert Chalmers
Just to let anyone else know who may be interested. This appears to have solved 
the last little bit of spam getting through, as well as removing email that had 
the addition to the Subject line of the SPAM** signal.

I haven’t had ANY spam sneak through now since I implemented this. Thanks 
Robert M.

Of course I have spam assassin itself running with the counts quite strict. I 
also have postscreen in postfix and other settings in there all quite rigid. 
and so on.





amavisd.conf

$log_level = 1; # set the log level to one
$sa_tag_level_deflt = -999; # i want to see the headers so change to -99
$sa_tag2_level_deflt = 5.0; # start with 5
$sa_kill_level_deflt = 9; # change to 9
$sa_dsn_cutoff_level = 9; # change to 9
$sa_quarantine_cutoff_level = 50; # remove the starting # and change to 50
$notify_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
$forward_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
$final_banned_destiny = D_BOUNCE; # change to D_DISCARD 


> On 21 Jan 2016, at 12:57, Robert Chalmers  wrote:
> 
> That looks to be just what I want. I now have it running, so will see how it 
> goes. Thanks for that. Much appreciated
> 
> There are a few other really good options, but this one is nice and compact, 
> no extra scripts.
> 
> I’m running amavis-new, postfix with postscreen fairly heavily in use, 
> dovecot, spamassassin and not much gets through, but enough to annoy me.
> 
> Thanks to everyone  for the pointers. very useful.
> 
> Robert
> 
> 
>> On 21 Jan 2016, at 12:41, Robert Moskowitz > > wrote:
>> 
>> I use amavis-new to do this:
>> 
>> amavisd.conf
>> 
>> $log_level = 1; # set the log level to one
>> $sa_tag_level_deflt = -999; # i want to see the headers so change to -99
>> $sa_tag2_level_deflt = 5.0; # start with 5
>> $sa_kill_level_deflt = 9; # change to 9
>> $sa_dsn_cutoff_level = 9; # change to 9
>> $sa_quarantine_cutoff_level = 50; # remove the starting # and change to 
>> 50
>> $notify_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
>> $forward_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
>> $final_banned_destiny = D_BOUNCE; # change to D_DISCARD 
>> 
>> 
>> 
>> On 01/21/2016 07:25 AM, Robert Chalmers wrote:
>>> I’m looking for a way to just dump mail that has the header modified with 
>>> the * SPAM * assignment.
>>> 
>>> I mean, not have the Client mail reader do it, just have either spamd, or 
>>> postfix/dovecot  dump it.
>>> 
>>> I’m sure I’ve seen something about doing this, but can’t find it 
>>> now…. lost in al the configurations.
>>> 
>>> thanks
>>> 
>>> 
>>> 
>>> Robert Chalmers
>>> rob...@chalmers.com .au  Quantum Radio: 
>>> http://tinyurl.com/lwwddov 
>>> Mac mini 6.2 - 2012, Intel Core i7,2.3 GHz, Memory:16 GB. El-Capitan 10.11. 
>>> 2TB Storage made up of - 
>>> Drive 0:HGST HTS721010A9E630. Upper bay. Drive 1:ST1000LM024 HN-M101MBB. 
>>> Lower Bay
>>> 
>>> 
>>> 
>> 
> 
> Robert Chalmers
> rob...@chalmers.com .au  Quantum Radio: 
> http://tinyurl.com/lwwddov 
> Mac mini 6.2 - 2012, Intel Core i7,2.3 GHz, Memory:16 GB. El-Capitan 10.11. 
> 2TB Storage made up of - 
> Drive 0:HGST HTS721010A9E630. Upper bay. Drive 1:ST1000LM024 HN-M101MBB. 
> Lower Bay
> 
> 
> 

Robert Chalmers
rob...@chalmers.com .au  Quantum Radio: 
http://tinyurl.com/lwwddov
Mac mini 6.2 - 2012, Intel Core i7,2.3 GHz, Memory:16 GB. El-Capitan 10.11. 2TB 
Storage made up of - 
Drive 0:HGST HTS721010A9E630. Upper bay. Drive 1:ST1000LM024 HN-M101MBB. Lower 
Bay





Re: Can your bayes do this?

2016-01-21 Thread Reindl Harald


Am 21.01.2016 um 14:17 schrieb RW:

On Thu, 21 Jan 2016 13:45:08 +0100
Christian Laußat wrote:


Am 21.01.2016 13:19, schrieb Reindl Harald:

no entirely when "urrently, SA's bayes tokens are single words" from
https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3c509d55a8.30...@gmail.com%3E
is still true

please review that response below and consider 2/4 word tokes
*additionally* in the SA-tokenizer and it will beat out the "new
magic" easily witha well trained bayes in all cases


Bogofilter has an option to specify how many tokens to put into
bayes. Here is an analysis of how effective this was:
http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html

In my opinion it's not worth the effort. You'll blow up your database
for little better matching rate.


The FNs dropped from 287 to 69, which I'd call a four-fold improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full spam, so
arguably it just did a better job in detecting the embedded spam.


also see http://www.paulgraham.com/sofar.html

When the spammers do try to rewrite their messages, they'll probably do 
it by replacing individual spammy tokens with phrases of more neutral 
words. But multi-word filters will learn and catch these phrases too

_

in doubt that "blown up database" can have the effect that you need less 
training samples for the same outcome




signature.asc
Description: OpenPGP digital signature


Re: Can your bayes do this?

2016-01-21 Thread RW
On Thu, 21 Jan 2016 13:45:08 +0100
Christian Laußat wrote:

> Am 21.01.2016 13:19, schrieb Reindl Harald:
> > no entirely when "urrently, SA's bayes tokens are single words" from
> > https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3c509d55a8.30...@gmail.com%3E
> > is still true
> > 
> > please review that response below and consider 2/4 word tokes
> > *additionally* in the SA-tokenizer and it will beat out the "new
> > magic" easily witha well trained bayes in all cases  
> 
> Bogofilter has an option to specify how many tokens to put into
> bayes. Here is an analysis of how effective this was:
> http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html
> 
> In my opinion it's not worth the effort. You'll blow up your database 
> for little better matching rate.

The FNs dropped from 287 to 69, which I'd call a four-fold improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full spam, so
arguably it just did a better job in detecting the embedded spam.


Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Robert Chalmers
That looks to be just what I want. I now have it running, so will see how it 
goes. Thanks for that. Much appreciated

There are a few other really good options, but this one is nice and compact, no 
extra scripts.

I’m running amavis-new, postfix with postscreen fairly heavily in use, dovecot, 
spamassassin and not much gets through, but enough to annoy me.

Thanks to everyone  for the pointers. very useful.

Robert


> On 21 Jan 2016, at 12:41, Robert Moskowitz  wrote:
> 
> I use amavis-new to do this:
> 
> amavisd.conf
> 
> $log_level = 1; # set the log level to one
> $sa_tag_level_deflt = -999; # i want to see the headers so change to -99
> $sa_tag2_level_deflt = 5.0; # start with 5
> $sa_kill_level_deflt = 9; # change to 9
> $sa_dsn_cutoff_level = 9; # change to 9
> $sa_quarantine_cutoff_level = 50; # remove the starting # and change to 50
> $notify_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
> $forward_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
> $final_banned_destiny = D_BOUNCE; # change to D_DISCARD 
> 
> 
> 
> On 01/21/2016 07:25 AM, Robert Chalmers wrote:
>> I’m looking for a way to just dump mail that has the header modified with 
>> the * SPAM * assignment.
>> 
>> I mean, not have the Client mail reader do it, just have either spamd, or 
>> postfix/dovecot  dump it.
>> 
>> I’m sure I’ve seen something about doing this, but can’t find it 
>> now…. lost in al the configurations.
>> 
>> thanks
>> 
>> 
>> 
>> Robert Chalmers
>> rob...@chalmers.com .au  Quantum Radio: 
>> http://tinyurl.com/lwwddov 
>> Mac mini 6.2 - 2012, Intel Core i7,2.3 GHz, Memory:16 GB. El-Capitan 10.11. 
>> 2TB Storage made up of - 
>> Drive 0:HGST HTS721010A9E630. Upper bay. Drive 1:ST1000LM024 HN-M101MBB. 
>> Lower Bay
>> 
>> 
>> 
> 

Robert Chalmers
rob...@chalmers.com .au  Quantum Radio: 
http://tinyurl.com/lwwddov
Mac mini 6.2 - 2012, Intel Core i7,2.3 GHz, Memory:16 GB. El-Capitan 10.11. 2TB 
Storage made up of - 
Drive 0:HGST HTS721010A9E630. Upper bay. Drive 1:ST1000LM024 HN-M101MBB. Lower 
Bay





Re: Can your bayes do this?

2016-01-21 Thread Dianne Skoll
On Thu, 21 Jan 2016 12:11:15 +
RW  wrote:

>   "ambulatory care" -> only in ham
...
> is that you have discarded the count information.

And his assertion is not necessarily true, either.  According to our
statistics, we've seen "ambulatory care" in 1400 spams, but also in 22
spams.  While 1400/1422 still makes the token useful for Bayes, his algorithm
would discount it altogether because it's not "pure" ham.

Regards,

Dianne.


Re: Can your bayes do this?

2016-01-21 Thread Dianne Skoll
On Wed, 20 Jan 2016 22:21:49 -0800
Marc Perkel  wrote:

> Here is a list of 5505874 words and phrases used in the subject line
> of HAM and never seen in the subject line of SPAM

> Here is a list of 3494938 words and phrases used in the subject line
> of SPAM and never seen in the subject line of HAM

[snip]

And what, exactly, is your point?  Bayes would handle that just fine.
Tokens in your first list would score 0.00 for spam probability and
tokens in your second list would score 1.00 and Bayes would be great.

Regards,

Dianne.


Re: Can your bayes do this?

2016-01-21 Thread Christian Laußat

Am 21.01.2016 13:19, schrieb Reindl Harald:

no entirely when "urrently, SA's bayes tokens are single words" from
https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3c509d55a8.30...@gmail.com%3E
is still true

please review that response below and consider 2/4 word tokes
*additionally* in the SA-tokenizer and it will beat out the "new
magic" easily witha well trained bayes in all cases


Bogofilter has an option to specify how many tokens to put into bayes. 
Here is an analysis of how effective this was:

http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html

In my opinion it's not worth the effort. You'll blow up your database 
for little better matching rate.


--
Christian Laußat
https://blog.laussat.de


Re: Can your bayes do this?

2016-01-21 Thread RW
On Thu, 21 Jan 2016 13:19:20 +0100
Reindl Harald wrote:

> Am 21.01.2016 um 13:11 schrieb RW:
> > On Wed, 20 Jan 2016 22:21:49 -0800
> > Marc Perkel wrote:
> >  
> >> OK - Just to show you this isn't Bayesian - see if you can do this.
> >>
> >> Here is a list of 5505874 words and phrases used in the subject
> >> line of HAM and never seen in the subject line of SPAM
> >>
> >> http://www.junkemailfilter.com/data/subject-ham.txt
> >>
> >> Here is a list of 3494938 words and phrases used in the subject
> >> line of SPAM and never seen in the subject line of HAM
> >>
> >> http://www.junkemailfilter.com/data/subject-spam.txt
> >>
> >> Hope you understand it now. Not Bayesian  
> >
> >
> > the only difference between
> >
> >
> >"ambulatory care" -> only in ham
> >"aall cards"  -> only in spam
> >
> > and
> >
> > "ambulatory care"  occurs 16 times in ham and 0 times in spam
> > "aall cards"   occurs  0 times in ham and 3 times in spam
> >
> > is that you have discarded the count information  
> 
> no entirely when "urrently, SA's bayes tokens are single words" from 


Yes, obviously. The assertion was that it's doing something that a
Bayesian filter can't  -  not specifically Bayes.


Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Robert Moskowitz

I use amavis-new to do this:

amavisd.conf

$log_level = 1; # set the log level to one
$sa_tag_level_deflt = -999; # i want to see the headers so change 
to -99

$sa_tag2_level_deflt = 5.0; # start with 5
$sa_kill_level_deflt = 9; # change to 9
$sa_dsn_cutoff_level = 9; # change to 9
$sa_quarantine_cutoff_level = 50; # remove the starting # and 
change to 50

$notify_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
$forward_method = 'smtp:[127.0.0.1]:10025'; # uncomment the line
$final_banned_destiny = D_BOUNCE; # change to D_DISCARD



On 01/21/2016 07:25 AM, Robert Chalmers wrote:
I’m looking for a way to just dump mail that has the header modified 
with the * SPAM * assignment.


I mean, not have the Client mail reader do it, just have either spamd, 
or postfix/dovecot  dump it.


I’m sure I’ve seen something about doing this, but can’t find it 
now…. lost in al the configurations.


thanks



Robert Chalmers
rob...@chalmers.com .au Quantum Radio: 
http://tinyurl.com/lwwddov
Mac mini 6.2 - 2012, Intel Core i7,2.3 GHz, Memory:16 GB. El-Capitan 
10.11. 2TB Storage made up of -
Drive 0:HGST HTS721010A9E630. Upper bay. Drive 1:ST1000LM024 
HN-M101MBB. Lower Bay








Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Reindl Harald



Am 21.01.2016 um 13:34 schrieb Antony Stone:

On Thursday 21 January 2016 at 13:31:29, Reindl Harald wrote:


On 01/21/2016 01:25 PM, Robert Chalmers wrote:

I’m looking for a way to just dump mail that has the header modified
with the * SPAM * assignment.

I mean, not have the Client mail reader do it, just have either spamd,
or postfix/dovecot  dump it.

I’m sure I’ve seen something about doing this, but can’t find it now….
lost in al the configurations.


http://linux.die.net/man/1/spamass-milter


Does that work with postfix?  (Surely not with dovecot...)


surely, http://www.postfix.org/MILTER_README.html

just make sure you have proper "postscreen_dnsbl_sites" config and 
"postscreen_dnsbl_action = enforce" in front so that your contentfilter 
only needs to deal with the remaining 10% of delivery attempts


smtpd_milters = unix:/run/spamass-milter/spamass-milter.sock

sa-milt869  0.0  2.2 311972 91044 ?Ss   Jan20   0:20 
/usr/bin/perl -T -w /usr/bin/spamd --max-children=20 --min-children=5 
--min-spare=5 --max-spare=10 --max-conn-per-child=200 
--socketpath=/run/spamassassin/spamassassin.sock --socketmode=0666
sa-milt870  0.0  0.2 453832 10160 ?SNsl Jan20   0:14 
/usr/sbin/spamass-milter -p /run/spamass-milter/spamass-milter.sock -g 
sa-milt -r 8.0 -- -s 10485760 --socket=/run/spamassassin/spamassassin.sock




signature.asc
Description: OpenPGP digital signature


Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Jari Fredriksson

Antony Stone kirjoitti 21.1.2016 14:34:

On Thursday 21 January 2016 at 13:31:29, Reindl Harald wrote:


> On 01/21/2016 01:25 PM, Robert Chalmers wrote:
>> I’m looking for a way to just dump mail that has the header modified
>> with the * SPAM * assignment.
>>
>> I mean, not have the Client mail reader do it, just have either spamd,
>> or postfix/dovecot  dump it.
>>
>> I’m sure I’ve seen something about doing this, but can’t find it now….
>> lost in al the configurations.

http://linux.die.net/man/1/spamass-milter


Does that work with postfix?  (Surely not with dovecot...)


Antony.


Yes. it works with postfix.

--
jarif.bit



Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Antony Stone
On Thursday 21 January 2016 at 13:31:29, Reindl Harald wrote:

> > On 01/21/2016 01:25 PM, Robert Chalmers wrote:
> >> I’m looking for a way to just dump mail that has the header modified
> >> with the * SPAM * assignment.
> >> 
> >> I mean, not have the Client mail reader do it, just have either spamd,
> >> or postfix/dovecot  dump it.
> >> 
> >> I’m sure I’ve seen something about doing this, but can’t find it now….
> >> lost in al the configurations.
>
> http://linux.die.net/man/1/spamass-milter

Does that work with postfix?  (Surely not with dovecot...)


Antony.

-- 
Wanted: telepath.   You know where to apply.

   Please reply to the list;
 please *don't* CC me.


Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Reindl Harald



Am 21.01.2016 um 13:29 schrieb Axb:

On 01/21/2016 01:25 PM, Robert Chalmers wrote:

I’m looking for a way to just dump mail that has the header modified
with the * SPAM * assignment.

I mean, not have the Client mail reader do it, just have either spamd,
or postfix/dovecot  dump it.

I’m sure I’ve seen something about doing this, but can’t find it now….
lost in al the configurations.


if "dump" means delete - you may want to rethink this - it's safer to
MOVE to a "junk" folder...

Whatever, as you have Dovecot, I'd recommend using a sieve rule to do
whatever you decide is best with tagged spam


it means most likely REJECT 100% clear spam proper
http://linux.die.net/man/1/spamass-milter

* tag subject above 5.0
* reject above 8.0

normally you don't want to move *all* spam, even the crap with a score 
above 50 into a junk folder, you want there just the mails between 5.0 
and 8.0




signature.asc
Description: OpenPGP digital signature


Re: Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Axb

On 01/21/2016 01:25 PM, Robert Chalmers wrote:

I’m looking for a way to just dump mail that has the header modified with the 
* SPAM * assignment.

I mean, not have the Client mail reader do it, just have either spamd, or 
postfix/dovecot  dump it.

I’m sure I’ve seen something about doing this, but can’t find it now…. lost in 
al the configurations.



if "dump" means delete - you may want to rethink this - it's safer to 
MOVE to a "junk" folder...


Whatever, as you have Dovecot, I'd recommend using a sieve rule to do 
whatever you decide is best with tagged spam.





Looking for a way to dump spam assassin modified mail

2016-01-21 Thread Robert Chalmers
I’m looking for a way to just dump mail that has the header modified with the 
* SPAM * assignment.

I mean, not have the Client mail reader do it, just have either spamd, or 
postfix/dovecot  dump it.

I’m sure I’ve seen something about doing this, but can’t find it now…. lost in 
al the configurations.

thanks



Robert Chalmers
rob...@chalmers.com .au  Quantum Radio: 
http://tinyurl.com/lwwddov
Mac mini 6.2 - 2012, Intel Core i7,2.3 GHz, Memory:16 GB. El-Capitan 10.11. 2TB 
Storage made up of - 
Drive 0:HGST HTS721010A9E630. Upper bay. Drive 1:ST1000LM024 HN-M101MBB. Lower 
Bay





Re: Can your bayes do this?

2016-01-21 Thread Reindl Harald



Am 21.01.2016 um 13:11 schrieb RW:

On Wed, 20 Jan 2016 22:21:49 -0800
Marc Perkel wrote:


OK - Just to show you this isn't Bayesian - see if you can do this.

Here is a list of 5505874 words and phrases used in the subject line
of HAM and never seen in the subject line of SPAM

http://www.junkemailfilter.com/data/subject-ham.txt

Here is a list of 3494938 words and phrases used in the subject line
of SPAM and never seen in the subject line of HAM

http://www.junkemailfilter.com/data/subject-spam.txt

Hope you understand it now. Not Bayesian



the only difference between


   "ambulatory care" -> only in ham
   "aall cards"  -> only in spam

and

"ambulatory care"  occurs 16 times in ham and 0 times in spam
"aall cards"   occurs  0 times in ham and 3 times in spam

is that you have discarded the count information


no entirely when "urrently, SA's bayes tokens are single words" from 
https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3c509d55a8.30...@gmail.com%3E 
is still true


please review that response below and consider 2/4 word tokes 
*additionally* in the SA-tokenizer and it will beat out the "new magic" 
easily witha well trained bayes in all cases


 Weitergeleitete Nachricht 
Betreff: Re: My new method for blocking spam - REVEALED!
Datum: Wed, 20 Jan 2016 15:20:01 -0500
Von: Dianne Skoll 
Organisation: Roaring Penguin Software Inc.
An: users@spamassassin.apache.org

On Wed, 20 Jan 2016 12:11:02 -0800
Marc Perkel  wrote:

> Again - it's not about matching as Bayes does. It's about not
> matching.

It's not about not matching. It's about a preprocessing step that
discards tokens that don't have extreme probabilities.

I think your method works as well as it does because you're using up
to four-word phrases as tokens. The rest of the method is nonsense, but
the four-word phrase tokens are the magic ingredient; they'd make Bayes 
work awesomely also.




signature.asc
Description: OpenPGP digital signature


Re: Can your bayes do this?

2016-01-21 Thread Antony Stone
On Thursday 21 January 2016 at 13:11:15, RW wrote:

> On Wed, 20 Jan 2016 22:21:49 -0800 Marc Perkel wrote:
> > OK - Just to show you this isn't Bayesian - see if you can do this.
> > 
> > Here is a list of 5505874 words and phrases used in the subject line
> > of HAM and never seen in the subject line of SPAM
> > 
> > http://www.junkemailfilter.com/data/subject-ham.txt
> > 
> > Here is a list of 3494938 words and phrases used in the subject line
> > of SPAM and never seen in the subject line of HAM
> > 
> > http://www.junkemailfilter.com/data/subject-spam.txt
> > 
> > Hope you understand it now. Not Bayesian
> 
> the only difference between
> 
> 
>   "ambulatory care" -> only in ham
>   "aall cards"  -> only in spam
> 
> and
> 
>"ambulatory care"  occurs 16 times in ham and 0 times in spam
> 
>"aall cards"   occurs  0 times in ham and 3 times in spam
> 
> is that you have discarded the count information.

Plus, the "never in ham" and "never in spam" lists omit any mention of words & 
phrases which exist in differing proportions in both - Bayes includes that, and 
I would expect that a spam identifier which takes account of as many known 
charactersistics of spam/ham as possible is going to do the best job.


Antony.

-- 
Software development can be quick, high quality, or low cost.

The customer gets to pick any two out of three.

   Please reply to the list;
 please *don't* CC me.


Re: Can your bayes do this?

2016-01-21 Thread RW
On Wed, 20 Jan 2016 22:21:49 -0800
Marc Perkel wrote:

> OK - Just to show you this isn't Bayesian - see if you can do this.
> 
> Here is a list of 5505874 words and phrases used in the subject line
> of HAM and never seen in the subject line of SPAM
> 
> http://www.junkemailfilter.com/data/subject-ham.txt
> 
> Here is a list of 3494938 words and phrases used in the subject line
> of SPAM and never seen in the subject line of HAM
> 
> http://www.junkemailfilter.com/data/subject-spam.txt
> 
> Hope you understand it now. Not Bayesian


the only difference between


  "ambulatory care" -> only in ham
  "aall cards"  -> only in spam

and 
   

   "ambulatory care"  occurs 16 times in ham and 0 times in spam
   
   "aall cards"   occurs  0 times in ham and 3 times in spam

is that you have discarded the count information.



Re: Can your bayes do this?

2016-01-21 Thread Reindl Harald



Am 21.01.2016 um 07:21 schrieb Marc Perkel:

OK - Just to show you this isn't Bayesian - see if you can do this.

Here is a list of 5505874 words and phrases used in the subject line of
HAM and never seen in the subject line of SPAM

http://www.junkemailfilter.com/data/subject-ham.txt

Here is a list of 3494938 words and phrases used in the subject line of
SPAM and never seen in the subject line of HAM

http://www.junkemailfilter.com/data/subject-spam.txt

Hope you understand it now. Not Bayesian


don't get me wrong but i don't take anybody serious who needs "" and 
when you don't stop advertising that aggressive you are classified as 
spammer too


177 MB only subjects?

well, not really impressive given that i easly get the same results with 
a 81 MB bayes-db containing the *complete* junk of 1.5 years while only 
selected ham (reported wrongly classified, my personal mail and a few 
inboxes from nice users)


when i can get with a 600 MB corpus containing around 81000 messages the 
same results the only thing i understand now is that it's not really 
efficient and needs access to all mails for training which is a no-go


[harry@srv-rhsoft:~]$ curl --head 
http://www.junkemailfilter.com/data/subject-spam.txt

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2016 08:12:15 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 21 Jan 2016 06:11:41 GMT
ETag: "340315d-446e47c-529d1f9f0676b"
Accept-Ranges: bytes
Content-Length: 71754876
Connection: close
Content-Type: text/plain

[harry@srv-rhsoft:~]$ curl --head 
http://www.junkemailfilter.com/data/subject-ham.txt

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2016 08:12:25 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 21 Jan 2016 06:09:18 GMT
ETag: "340309c-645b7a1-529d1f16ad5db"
Accept-Ranges: bytes
Content-Length: 105232289
Connection: close
Content-Type: text/plain



signature.asc
Description: OpenPGP digital signature