Re: Spam messages autolearned as ham

2014-09-29 Thread Jari Fredriksson
26.09.2014, 02:53, Amir Caspi kirjoitti:
> As a result, I've got plenty of "great" fresh spam to feed the filter.  I've 
> also got plenty of great ham.
Could you take a share in MassChecks? Currently SpamAssassin masschecks
seem to need more fresh spam and ham. Would be great to have you within
the team.

-- 
jarif.bit




signature.asc
Description: OpenPGP digital signature


Re: Spam messages autolearned as ham

2014-09-26 Thread Matus UHLAR - fantomas

I'm not sure wiping BAYES is needed, unless training does not


On 26.09.14 09:11, John Hardin wrote:
He has autolearn running. Unless he has copies of the spams that were 
learned as ham, there's no way to totally undo that short of wipe and 
start over from scratch.


depends on how much of the messages are incorrectly trained/classified


You *did* keep your initial Bayes training corpora, right?


this is very good idea to have. Maybe at least keeping all autolearned spam
and ham for some time, just for the possibility of retraining.


The critical part is to have base corpora of *correctly classified* 
(i.e. manually reviewed) messages. If you're keeping copies of 
autolearned messages (which will probably be quite a few) then you 
*need* to *manually review* them before using them for retraining, 
otherwise you'll probably end up simply rebuilding a mistrained 
database.


my point was: when someone relies on autolearn, messages that were
auto-lerned are those that need to be kept, since they affect scoring
mostly. if they were learned incorrectly, they make biggest difference when
re-trained properly.

...since autolearning is quite common, this is one of easiest ways to avoid
mistrainings.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
(R)etry, (A)bort, (C)ancer


Re: Spam messages autolearned as ham

2014-09-26 Thread Reindl Harald

Am 26.09.2014 um 18:19 schrieb John Hardin:
> On Fri, 26 Sep 2014, Reindl Harald wrote:
> 
>> frankly the biggest problem is the large amount of idiots hit
>> a "spam" button whenever they can to stop receive some sort of
>> mail - i had that even in my own family "can't you block that?"
>> followed by "yes" after asking "have you subscribed there?"
>>
>> guess how likely it happens for a large mail provider sending
>> 100% clean mail with double-optin to 1 persons when there
>> are 5% complete idiots - it results in 500 spam reports about
>> the same message - the hardest job by maintaining a blacklist
>> is to catch that idiots and prevent harm for innocent
> 
> This is exactly why user-submitted training messages should be manually 
> reviewed before being learned

i know - the 3 messages i talked also where in fact no spam but composed
that stupid to match a lot of common spam classifiers and so neutralized
them in bayes which was enough to get spam scored 0.2 points under the
milter-reject value

the current bayes contains 3466 messages all reviewed by me

it showed me even review is not error safe hence the word
"autolearn" is a no-go for me - i doubt that a software
learning by it's own decisions will do it always right

i saw that multiple times with a "Barracuda Spamfirewall"
also using SA behind the scenes with autolearning - after
a new deployment all was just fine, some manual training
and perfect results

over the moths it became every time worser

* more junk slipped through
* more and more legit mail got tagged






signature.asc
Description: OpenPGP digital signature


Re: Spam messages autolearned as ham

2014-09-26 Thread John Hardin

On Fri, 26 Sep 2014, Reindl Harald wrote:


frankly the biggest problem is the large amount of idiots hit
a "spam" button whenever they can to stop receive some sort of
mail - i had that even in my own family "can't you block that?"
followed by "yes" after asking "have you subscribed there?"

guess how likely it happens for a large mail provider sending
100% clean mail with double-optin to 1 persons when there
are 5% complete idiots - it results in 500 spam reports about
the same message - the hardest job by maintaining a blacklist
is to catch that idiots and prevent harm for innocent


This is exactly why user-submitted training messages should be manually 
reviewed before being learned.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The difference between ignorance and stupidity is that the stupid
  desire to remain ignorant. -- Jim Bacon
---
 848 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-26 Thread John Hardin

On Fri, 26 Sep 2014, Matus UHLAR - fantomas wrote:


On 25.09.14 07:51, John Hardin wrote:
You are probably going to have to wipe and retrain your bayes database from 
scratch using known-good (i.e. hand classified) corpora. I also suggest 
turning off autolearn.


I'm not sure wiping BAYES is needed, unless training does not


He has autolearn running. Unless he has copies of the spams that were 
learned as ham, there's no way to totally undo that short of wipe and 
start over from scratch.



You *did* keep your initial Bayes training corpora, right?


this is very good idea to have. Maybe at least keeping all autolearned spam
and ham for some time, just for the possibility of retraining.


The critical part is to have base corpora of *correctly classified* (i.e. 
manually reviewed) messages. If you're keeping copies of autolearned 
messages (which will probably be quite a few) then you *need* to *manually 
review* them before using them for retraining, otherwise you'll probably 
end up simply rebuilding a mistrained database.


If you have users submitting FP/FN messages for training, and the admin 
verifies them before training with them (which should be done unless the 
judgement and responsibility of the user in question is trusted), that's a 
good source for part of your base retraining corpora.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The difference between ignorance and stupidity is that the stupid
  desire to remain ignorant. -- Jim Bacon
---
 848 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-26 Thread Reindl Harald

Am 26.09.2014 um 17:57 schrieb Matus UHLAR - fantomas:
> On 25.09.14 16:07, Reindl Harald wrote:
>> that's why BAYES_99 together with score BAYES_999 0.5 since
>> it happens only very rare for legit mail and that ones
>> have mostly whitelists, SPF, DKIM to keep the result below 8
>>
>> score BAYES_00 -2.5
>> score BAYES_05 -0.7
>> score BAYES_20 -0.06
>> score BAYES_40 -0.03
>> score BAYES_50 2.0
>> score BAYES_60 3.0
>> score BAYES_80 3.7
>> score BAYES_95 5.8
>> score BAYES_99 7.5
>> score BAYES_999 0.5
> 
> FYI, the SA scores are currently these:

default, yes

> score BAYES_00  0  0 -1.5   -1.9
> score BAYES_05  0  0 -0.3   -0.5
> score BAYES_20  0  0 -0.001 -0.001
> score BAYES_40  0  0 -0.001 -0.001
> score BAYES_50  0  0  2.00.8
> score BAYES_60  0  0  2.51.5
> score BAYES_80  0  0  2.72.0
> score BAYES_95  0  0  3.23.0
> score BAYES_99  0  0  3.83.5
> score BAYES_999 0  0  0.20.2
> 
> as you can see, they produce no score when BAYES is disabled (first two
> numbers)

well, if bayes is disabled bayes don't produce score :-)

> I don't recommend you changing them, unless you know what you are doing (and
> risking)

i know - sa-milter here blocks above 8.0 and my bayse contains at the
moment 3200 messages (half ham, half spam) manually trained
messages with no autolearning

in that environment 3.7 points are useless for 99.999% sure spam
currently we reject around 3 messages per day and aceept 3051
in fact nearly zero spam ever touchs a inbox and over 4 weeks 5
complaints about a false-positive, none of them SA related

until there is no whitelist, SPF match and so on it is trusted
to reject by train data and at the same time if other SA rules
are hitted including a URIBL the -2.5 are trusted to prevent false
positives just because some idiot sent legit mail to a blacklist
instead hit the unsubscribe button what happens way too often

frankly the biggest problem is the large amount of idiots hit
a "spam" button whenever they can to stop receive some sort of
mail - i had that even in my own family "can't you block that?"
followed by "yes" after asking "have you subscribed there?"

guess how likely it happens for a large mail provider sending
100% clean mail with double-optin to 1 persons when there
are 5% complete idiots - it results in 500 spam reports about
the same message - the hardest job by maintaining a blacklist
is to catch that idiots and prevent harm for innocent



signature.asc
Description: OpenPGP digital signature


Re: Spam messages autolearned as ham

2014-09-26 Thread John Hardin

On Fri, 26 Sep 2014, Matus UHLAR - fantomas wrote:


your own caching DNS server? does your mail server use it?
You seem to have too much mail then.


Be careful with terminology there. It's not whether it's caching, it's 
whether it forwards lookups to an upstream DNS server. You can have a 
caching forwarding DNS server that will reduce your upstream traffic but 
still rely entirely on the upstream server. That is likely what is 
currently set up here, and is the cause of the URIBL_BLOCKED problem.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The difference between ignorance and stupidity is that the stupid
  desire to remain ignorant. -- Jim Bacon
---
 848 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-26 Thread Matus UHLAR - fantomas

On 25.09.14 07:51, John Hardin wrote:
You are probably going to have to wipe and retrain your bayes 
database from scratch using known-good (i.e. hand classified) 
corpora. I also suggest turning off autolearn.


I'm not sure wiping BAYES is needed, unless training does not 


You *did* keep your initial Bayes training corpora, right?


this is very good idea to have. Maybe at least keeping all autolearned spam
and ham for some time, just for the possibility of retraining.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
2B|!2B, that's a question!


Re: Spam messages autolearned as ham

2014-09-26 Thread Matus UHLAR - fantomas

On 25.09.14 16:07, Reindl Harald wrote:

that's why BAYES_99 together with score BAYES_999 0.5 since
it happens only very rare for legit mail and that ones
have mostly whitelists, SPF, DKIM to keep the result below 8

score BAYES_00 -2.5
score BAYES_05 -0.7
score BAYES_20 -0.06
score BAYES_40 -0.03
score BAYES_50 2.0
score BAYES_60 3.0
score BAYES_80 3.7
score BAYES_95 5.8
score BAYES_99 7.5
score BAYES_999 0.5


FYI, the SA scores are currently these:

score BAYES_00  0  0 -1.5   -1.9
score BAYES_05  0  0 -0.3   -0.5
score BAYES_20  0  0 -0.001 -0.001
score BAYES_40  0  0 -0.001 -0.001
score BAYES_50  0  0  2.00.8
score BAYES_60  0  0  2.51.5
score BAYES_80  0  0  2.72.0
score BAYES_95  0  0  3.23.0
score BAYES_99  0  0  3.83.5
score BAYES_999 0  0  0.20.2

as you can see, they produce no score when BAYES is disabled (first two
numbers)

I don't recommend you changing them, unless you know what you are doing (and
risking).

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
LSD will make your ECS screen display 16.7 million colors


Re: Spam messages autolearned as ham

2014-09-26 Thread Matus UHLAR - fantomas

thoughts? you changed a score and SA did what you told it to.


On 25.09.14 11:06, Deeztek Support wrote:

I changed it as per the suggestion of Matus UHLAR - fantomas


no. I have wondered why you have chnged it to sero, when SA rules have
negative values. 


I apparently forgot to note that you should comment out the setting and se
the one from SA rules.


as already suggested by John Hardin, fix URIBL_BLOCKED=0.001

"Also: URIBL_BLOCKED - you really want to set up a local recursive
(non-forwarding) DNS server for SA so that your URIBL lookups will work,
that might help a lot. "


I can certainly try that, however seeing that I'm implementing block 
lists on the postfix level, wouldn't that double the lookups?


certainly not. you have doubled them when you set up checks in postfix.
SA clearly uses them too, otherwise it would not get the BLOCKED reply.


And as an FYI, I'm running my own DNS server.


your own caching DNS server? does your mail server use it?
You seem to have too much mail then.

(local uribl mirror should be free of charge iirc)

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Chernobyl was an Windows 95 beta test site.


Re: Spam messages autolearned as ham

2014-09-26 Thread Matus UHLAR - fantomas

On 9/25/2014 6:31 AM, Matus UHLAR - fantomas wrote:
> I recommend tou to clear score for RP_MATCHES_RCVD... apparently too much
> FNs as you can see here



On 09/25/2014 03:26 PM, Deeztek Support wrote:

How would I go about clearning out the RP_MATCHES_RCVD score?


On 25.09.14 15:38, Axb wrote:
If you disable the rule it will probably mess up a bunch of metas so 
the best may be to set


there are no meta rules for RP_MATCHES_RCVD - for  we have __RP_MATCHES_RCVD
and all available metas (at least in SA rules) use that one.

therefore, anyone can easily disable RP_MATCHES_RCVD without negative
effects.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Enter any 12-digit prime number to continue.


Re: Spam messages autolearned as ham

2014-09-25 Thread Daniel Staal
--As of September 25, 2014 11:13:16 AM -0400, Deeztek Support is alleged to 
have said:




You *did* keep your initial Bayes training corpora, right?



I have an account that I have used to sign up for everything under the
sun over the past 10 years. It's a goldmine for spam. I figured I use
that to train the Bayes.


--As for the rest, it is mine.

If it's not the same types of spam as your main mail accounts, it's pretty 
much useless for bayes training.  Check.  ;)


Also: Make sure you train enough ham.  Bayes needs to learn what's 
*different* about spam and ham.


Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---


Re: Spam messages autolearned as ham

2014-09-25 Thread Amir Caspi
On Sep 25, 2014, at 10:35 AM, Axb  wrote:

> imo, fresh spam is the best spam.

I've got plenty...

> Nowadays, we tend to reejct most good fodder with all kinds of methods at 
> SMTP level and what's left is often hardly enough to keep a bayes DB well fed.

In my case, spam is quarantined but not rejected.  All of my SMTP rejects are 
DNSBL-based.  As a result, I've got plenty of "great" fresh spam to feed the 
filter.  I've also got plenty of great ham.

I get about 5-10 FNs every day, most of them are due to new spam templates that 
my local.cf isn't catching, but occasionally a few will get BAYES_00.  This is 
happening a LOT less these days than it used to, but I'm still considering 
whether to nuke and rebuild my DB...

--- Amir



Re: Spam messages autolearned as ham

2014-09-25 Thread John Hardin

On Thu, 25 Sep 2014, Deeztek Support wrote:


On 9/25/2014 1:25 PM, John Hardin wrote:.


 While your Postfix may be doing DNS blocklist checks on the sending MTA,
 I sincerely doubt that Postfix is parsing message bodies to pull out URI
 domains and checking them. That's what URIBL is.


Is there a place to configure the URIBLs that SA uses or is it just buit-in?


You can add rules for custom URIBL lookups. There is a base set that's 
enabled by default and which can be disabled. I don't fiddle with that 
much (my install is stable) so I don't know the details right off the top 
of my head.


IOW, "see the docs". :)


 Also, even if Postfix *was* doing that, the "URIBL_BLOCKED" rule hit
 indicates a local configuration that would likely also be affecting
 Postfix. So, yes, Postfix *might* be doing URIBL lookups, but if it is
 it's probably also getting the BLOCKED result.


Actually that's not happening at all. None of the lists we are using are 
blocking us.


You are getting a URIBL_BLOCKED rule hit. The URIBL servers *are* blocking 
your queries for overuse. That's what that rule means.


Note that it says nothing about DNSBL queries, only URIBL queries.


 If you're running your own DNS server, it's apparently set to forward to
 a large upstream DNS server that's aggregating other queries with yours
 (i.e. a standard DNS setup). "URIBL_BLOCKED" means the DNS server that's
 actually hitting the URIBL server (your upstream) has exceeded the
 "free" query limit.


You are right it is using an upstream server (opendns.com)


Yep.


 You might not want to switch your DNS to be recursive rather than
 forwarding for *all* your queries, in which case you'd set up a
 dedicated recursive DNS server just for MTA/SA use, and the rest of your
 network would continue to use your forwarding server.


That shouldn't be too difficult to implement.


Nope.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
   "A well educated Electorate, being necessary to the liberty of a
free State, the Right of the People to Keep and Read Books,
shall not be infringed."
  ...means only registered voters can read books, and only those books
  obtained with State permission from State-controlled bookstores?
---
 847 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-25 Thread Reindl Harald


Am 25.09.2014 um 19:44 schrieb Deeztek Support:
> On 9/25/2014 1:25 PM, John Hardin wrote:.
>>
>> While your Postfix may be doing DNS blocklist checks on the sending MTA,
>> I sincerely doubt that Postfix is parsing message bodies to pull out URI
>> domains and checking them. That's what URIBL is.
> 
> Is there a place to configure the URIBLs that SA uses or is it just buit-in?

built-in as long you disable/override

just grep the .cf files in the update folder for URI

>> Also, even if Postfix *was* doing that, the "URIBL_BLOCKED" rule hit
>> indicates a local configuration that would likely also be affecting
>> Postfix. So, yes, Postfix *might* be doing URIBL lookups, but if it is
>> it's probably also getting the BLOCKED result.
> 
> Actually that's not happening at all. None of the lists we are using are 
> blocking us.

by luck or you don't know because most just respond
with a special code instead the expected 127.0.0.x
and not all dy long

>> If you're running your own DNS server, it's apparently set to forward to
>> a large upstream DNS server that's aggregating other queries with yours
>> (i.e. a standard DNS setup). "URIBL_BLOCKED" means the DNS server that's
>> actually hitting the URIBL server (your upstream) has exceeded the
>> "free" query limit.
>
> You are right it is using an upstream server (opendns.com)

that is plain wrong for a MTA
do recursion at your own

>> You might not want to switch your DNS to be recursive rather than
>> forwarding for *all* your queries, in which case you'd set up a
>> dedicated recursive DNS server just for MTA/SA use, and the rest of your
>> network would continue to use your forwarding server.
>>
> 
> That shouldn't be too difficult to implement

the better way would be have *two* recursion server in the
own network and use them - nothing easier than combine
recusrion and own zones



signature.asc
Description: OpenPGP digital signature


Re: Spam messages autolearned as ham

2014-09-25 Thread Deeztek Support

On 9/25/2014 1:25 PM, John Hardin wrote:.
>
> While your Postfix may be doing DNS blocklist checks on the sending MTA,
> I sincerely doubt that Postfix is parsing message bodies to pull out URI
> domains and checking them. That's what URIBL is.

Is there a place to configure the URIBLs that SA uses or is it just buit-in?

>
> Also, even if Postfix *was* doing that, the "URIBL_BLOCKED" rule hit
> indicates a local configuration that would likely also be affecting
> Postfix. So, yes, Postfix *might* be doing URIBL lookups, but if it is
> it's probably also getting the BLOCKED result.
>

Actually that's not happening at all. None of the lists we are using are 
blocking us.


>
> If you're running your own DNS server, it's apparently set to forward to
> a large upstream DNS server that's aggregating other queries with yours
> (i.e. a standard DNS setup). "URIBL_BLOCKED" means the DNS server that's
> actually hitting the URIBL server (your upstream) has exceeded the
> "free" query limit.
>

You are right it is using an upstream server (opendns.com)

> You might not want to switch your DNS to be recursive rather than
> forwarding for *all* your queries, in which case you'd set up a
> dedicated recursive DNS server just for MTA/SA use, and the rest of your
> network would continue to use your forwarding server.
>

That shouldn't be too difficult to implement.


Re: Spam messages autolearned as ham

2014-09-25 Thread John Hardin

On Thu, 25 Sep 2014, Amir Caspi wrote:


On Sep 25, 2014, at 8:51 AM, John Hardin  wrote:


You *did* keep your initial Bayes training corpora, right?


Does it matter if you keep the initial corpora, or just that you train on known corpora, 
even if they are "fluid?"


The "properly classified" part is critical.

If you have clean but fluid corpora that's ok. The point is, don't rely on 
autolearn by itself.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
   "A well educated Electorate, being necessary to the liberty of a
free State, the Right of the People to Keep and Read Books,
shall not be infringed."
  ...means only registered voters can read books, and only those books
  obtained with State permission from State-controlled bookstores?
---
 847 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-25 Thread John Hardin

On Thu, 25 Sep 2014, Deeztek Support wrote:


 as already suggested by John Hardin, fix URIBL_BLOCKED=0.001

 "Also: URIBL_BLOCKED - you really want to set up a local recursive
 (non-forwarding) DNS server for SA so that your URIBL lookups will work,
 that might help a lot. "


I can certainly try that, however seeing that I'm implementing block lists on 
the postfix level, wouldn't that double the lookups? And as an FYI, I'm 
running my own DNS server.


While your Postfix may be doing DNS blocklist checks on the sending MTA, I 
sincerely doubt that Postfix is parsing message bodies to pull out URI 
domains and checking them. That's what URIBL is.


Also, even if Postfix *was* doing that, the "URIBL_BLOCKED" rule hit 
indicates a local configuration that would likely also be affecting 
Postfix. So, yes, Postfix *might* be doing URIBL lookups, but if it is 
it's probably also getting the BLOCKED result.



If you're running your own DNS server, it's apparently set to forward to a 
large upstream DNS server that's aggregating other queries with yours 
(i.e. a standard DNS setup). "URIBL_BLOCKED" means the DNS server that's 
actually hitting the URIBL server (your upstream) has exceeded the "free" 
query limit.


You might not want to switch your DNS to be recursive rather than 
forwarding for *all* your queries, in which case you'd set up a dedicated 
recursive DNS server just for MTA/SA use, and the rest of your network 
would continue to use your forwarding server.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  A good high-school education is still essential, and
  college is where you go to get one.-- MiddleAgedKen
---
 847 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-25 Thread Axb

On 09/25/2014 05:24 PM, Amir Caspi wrote:

On Sep 25, 2014, at 8:51 AM, John Hardin  wrote:


You *did* keep your initial Bayes training corpora, right?


Does it matter if you keep the initial corpora, or just that you train on known corpora, 
even if they are "fluid?"


imo, fresh spam is the best spam.
old spam tokens may not work on fresh spam.
ham age is not as critical.

Nowadays, we tend to reejct most good fodder with all kinds of methods 
at SMTP level and what's left is often hardly enough to keep a bayes DB 
well fed.


A separate trap box/vm with a domain (or more) which takes all it gets 
(no rejetcs, no filtering) make a great source of spam fodder.


With a few tricks you can even auto feed bayes to a shared SQL/Redis 
backend giving you nice fresh spam tokens with minimal intervention.





Re: Spam messages autolearned as ham

2014-09-25 Thread Reindl Harald

Am 25.09.2014 um 17:24 schrieb Amir Caspi:
> On Sep 25, 2014, at 8:51 AM, John Hardin  wrote:
>>
>> You *did* keep your initial Bayes training corpora, right?
> 
> Does it matter if you keep the initial corpora, or just that you train on 
> known corpora, even if they are "fluid?"

yes because you can remove questionable messages, reset the bayes and start 
again

my train data are two folders with eml messages and if it turns
out that the bayes no longer works good a possible reason is
that you have too much neutralized tokens

since all eml-files are named by "date-number.eml" i could try to
move the oldest year out of the folder, reset and rebuild within
seconds

well, and you can do a fulltext search if you have a clue which
messages better not have been trained and rebuild the same way
after remove them - recently i noticed that brand new messages
trained as HAM to avoid them get marked as spam for a specific
user turned out to pass 3 clear spam messages within the next
10 minutes to myself - files deleted, rebuild, fine

what i never would like to do is reset bayes and
start by zero since i watched the filter quality
dramatically improve compared have 200, 1000 and
currently 1500 spam/ham examples



signature.asc
Description: OpenPGP digital signature


Re: Spam messages autolearned as ham

2014-09-25 Thread Amir Caspi
On Sep 25, 2014, at 8:51 AM, John Hardin  wrote:
> 
> You *did* keep your initial Bayes training corpora, right?

Does it matter if you keep the initial corpora, or just that you train on known 
corpora, even if they are "fluid?"

--- Amir
thumbed via iPhone



Re: Spam messages autolearned as ham

2014-09-25 Thread Reindl Harald
Am 25.09.2014 um 17:06 schrieb Deeztek Support:
> I can certainly try that, however seeing that I'm implementing 
> block lists on the postfix level, wouldn't that double the lookups?

first: if postscreen/postfix reject based on RBL score
the message don't make it to SA at all and in case of
a proper configured postscreen even not to smtpd

second: that's the reason for a local resolver: caching

third: URI blacklists are hardly the same request

> And as an FYI, I'm running my own DNS server

but it must not forward to another DNS like your ISP's
or Google 8.8.8.8 - it has to do *recursion* so that
the summary of your DNS requests and from other users
not appear as a lot from the same IP on the SOA

forwarding resolvers on a mailserver are general a bad idea

* if your ISP fucks up and responds no longer with NXDOMAIN
  by try to redirect websurfers to one of his pages you
  mailservices are in real danger

* most open resolvers are unstable and if it don't repsond
  properly from time to time mail otherwise blocked by
  DNSBL/URIBL slips through the filters

* in the worst case you make a lot of more DNS requests
  to the WAN because you get the TTL from the resolver
  and if it is short before expire, well, if you ask the
  SOA you get always the full TTL



signature.asc
Description: OpenPGP digital signature


Re: Spam messages autolearned as ham

2014-09-25 Thread Axb

On 09/25/2014 05:06 PM, Deeztek Support wrote:

as already suggested by John Hardin, fix URIBL_BLOCKED=0.001

"Also: URIBL_BLOCKED - you really want to set up a local recursive
(non-forwarding) DNS server for SA so that your URIBL lookups will work,
that might help a lot. "



I can certainly try that, however seeing that I'm implementing block
lists on the postfix level, wouldn't that double the lookups? And as an
FYI, I'm running my own DNS server.


Postfix doesn't do msg body URI lookups (unless you use a third party 
milter/filter)


URIBL_BLOCKED only shows up when you're using a recursor which is being 
blocked for hammering URIBL's public mirrors.


if your recursor is forwadring traffic to a third party's recursor, then 
you should remove the forward.




Re: Spam messages autolearned as ham

2014-09-25 Thread Deeztek Support

On 9/25/2014 10:51 AM, John Hardin wrote:



If BAYES_00 hits on a spam, that indicates training issues.


I understand.



Since you're reporting problems with autolearn, that's not at all
surprising. Your bayes database is probably polluted.

You are probably going to have to wipe and retrain your bayes database
from scratch using known-good (i.e. hand classified) corpora. I also
suggest turning off autolearn.


I wiped it.


You *did* keep your initial Bayes training corpora, right?



I have an account that I have used to sign up for everything under the 
sun over the past 10 years. It's a goldmine for spam. I figured I use 
that to train the Bayes.


Re: Spam messages autolearned as ham

2014-09-25 Thread Deeztek Support




thoughts? you changed a score and SA did what you told it to.



I changed it as per the suggestion of Matus UHLAR - fantomas


What are you trying to achieve (other than using the SA list as your
changes log)


Is that a trick question? I'm trying to ensure that spam messages are 
indeed tagged as such. I have a question of my own, is it a requirement 
for you to be an ass or does it just come naturally to you? If you have 
nothing to contribute besides accusing someone of stupid shit, here's a 
thought, don't contribute. I certainly wouldn't mind at all.




as already suggested by John Hardin, fix URIBL_BLOCKED=0.001

"Also: URIBL_BLOCKED - you really want to set up a local recursive
(non-forwarding) DNS server for SA so that your URIBL lookups will work,
that might help a lot. "


I can certainly try that, however seeing that I'm implementing block 
lists on the postfix level, wouldn't that double the lookups? And as an 
FYI, I'm running my own DNS server.


Thanks for your invaluable input!


Re: Spam messages autolearned as ham

2014-09-25 Thread John Hardin

On Thu, 25 Sep 2014, Deeztek Support wrote:


On 9/25/2014 9:26 AM, Deeztek Support wrote:

 On 9/25/2014 6:31 AM, Matus UHLAR - fantomas wrote:
>  On 24.09.14 14:03, Deeztek Support wrote:
> >  score BAYES_000.000
> 
>  why 0? current is -1.5 without and -1.9 with network checks...


 Do you mean that the default is supposed to be -1.5 without networks
 tests and -1.9 with network tests?


I went ahead and set BAYES_00 to -1.9 and I just received a spam message with 
these headers:


X-Spam-Status: No, score=0.204 tagged_above=-999 required=0.6
 tests=[BAYES_00=-1.9, DCC_CHECK=1.1, FROM_STARTS_WITH_NUMS=0.738,
 RP_MATCHES_RCVD=-0.735, URIBL_BLOCKED=0.001, URI_OPTOUT_3LD=1]
 autolearn=disabled

From looking at it, it looks like the BAYES_00 took away -1.9 which made the 
difference of whether or not it got tagged as spam or not. I don't think -1.9 
is the correct setting here. Any thoughts?


Having a negative score for BAYES_00 is the standard.

If BAYES_00 hits on a spam, that indicates training issues.

Since you're reporting problems with autolearn, that's not at all 
surprising. Your bayes database is probably polluted.


You are probably going to have to wipe and retrain your bayes database 
from scratch using known-good (i.e. hand classified) corpora. I also 
suggest turning off autolearn.


You *did* keep your initial Bayes training corpora, right?

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The yardstick you should use when considering whether to support a
  given piece of legislation is "what if my worst enemy is chosen to
  administer this law?"
---
 847 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-25 Thread John Hardin

On Thu, 25 Sep 2014, Axb wrote:


On 09/25/2014 03:26 PM, Deeztek Support wrote:

 On 9/25/2014 6:31 AM, Matus UHLAR - fantomas wrote:
>  On 24.09.14 14:03, Deeztek Support wrote:
> >  score BAYES_000.000
> 
>  why 0? current is -1.5 without and -1.9 with network checks...


 Do you mean that the default is supposed to be -1.5 without networks
 tests and -1.9 with network tests?

> >  X-Spam-Status: No, score=0.257 tagged_above=-999 required=0.6
> >  tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, HTML_MESSAGE=0.001,
> >  RP_MATCHES_RCVD=-0.653, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
> >  T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham
> 
>  I recommend tou to clear score for RP_MATCHES_RCVD... apparently too

 much
>  FNs as you can see here

 How would I go about clearning out the RP_MATCHES_RCVD score?


If you disable the rule it will probably mess up a bunch of metas so the best 
may be to set


score RP_MATCHES_RCVD 0.001


Or, as it is a "nice" rule,

  score RP_MATCHES_RCVD -0.001



in local.cf and restart amavis



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The yardstick you should use when considering whether to support a
  given piece of legislation is "what if my worst enemy is chosen to
  administer this law?"
---
 847 days since the first successful private support mission to ISS (SpaceX)


Re: Spam messages autolearned as ham

2014-09-25 Thread Axb

On 09/25/2014 04:02 PM, Deeztek Support wrote:

I went ahead and set BAYES_00 to -1.9 and I just received a spam message
with these headers:

X-Spam-Status: No, score=0.204 tagged_above=-999 required=0.6
 tests=[BAYES_00=-1.9, DCC_CHECK=1.1, FROM_STARTS_WITH_NUMS=0.738,
 RP_MATCHES_RCVD=-0.735, URIBL_BLOCKED=0.001, URI_OPTOUT_3LD=1]
 autolearn=disabled

 From looking at it, it looks like the BAYES_00 took away -1.9 which
made the difference of whether or not it got tagged as spam or not. I
don't think -1.9 is the correct setting here. Any thoughts?


thoughts? you changed a score and SA did what you told it to.

What are you trying to achieve (other than using the SA list as your 
changes log)


as already suggested by John Hardin, fix URIBL_BLOCKED=0.001

"Also: URIBL_BLOCKED - you really want to set up a local recursive 
(non-forwarding) DNS server for SA so that your URIBL lookups will work, 
that might help a lot. "





Re: Spam messages autolearned as ham

2014-09-25 Thread Reindl Harald

Am 25.09.2014 um 16:02 schrieb Deeztek Support:
> On 9/25/2014 9:26 AM, Deeztek Support wrote:
>> On 9/25/2014 6:31 AM, Matus UHLAR - fantomas wrote:
>>  > On 24.09.14 14:03, Deeztek Support wrote:
>>  >> score BAYES_000.000
>>  >
>>  > why 0? current is -1.5 without and -1.9 with network checks...
>>
>> Do you mean that the default is supposed to be -1.5 without networks
>> tests and -1.9 with network tests?
> 
> I went ahead and set BAYES_00 to -1.9 and I just received a spam message with 
> these headers:
> 
> X-Spam-Status: No, score=0.204 tagged_above=-999 required=0.6
> tests=[BAYES_00=-1.9, DCC_CHECK=1.1, FROM_STARTS_WITH_NUMS=0.738,
> RP_MATCHES_RCVD=-0.735, URIBL_BLOCKED=0.001, URI_OPTOUT_3LD=1]
> autolearn=disabled
> 
> From looking at it, it looks like the BAYES_00 took away -1.9 which made the 
> difference of whether or not it got
> tagged as spam or not. I don't think -1.9 is the correct setting here. Any 
> thoughts?

train your bayse better, ihave around 1500 ham and 1500 spam
messages classified with no autolearning at all, sa-milter
rejects with a score above 8 and the bayes is nearly error
free

that's why BAYES_99 together with score BAYES_999 0.5 since
it happens only very rare for legit mail and that ones
have mostly whitelists, SPF, DKIM to keep the result below 8

score BAYES_00 -2.5
score BAYES_05 -0.7
score BAYES_20 -0.06
score BAYES_40 -0.03
score BAYES_50 2.0
score BAYES_60 3.0
score BAYES_80 3.7
score BAYES_95 5.8
score BAYES_99 7.5
score BAYES_999 0.5


[sa-milt@mail-gw:~/training]$ ls ham/ | wc -l
1746
[sa-milt@mail-gw:~/training]$ ls spam/ | wc -l
1712



signature.asc
Description: OpenPGP digital signature


Re: Spam messages autolearned as ham

2014-09-25 Thread Deeztek Support

On 9/25/2014 9:26 AM, Deeztek Support wrote:

On 9/25/2014 6:31 AM, Matus UHLAR - fantomas wrote:
 > On 24.09.14 14:03, Deeztek Support wrote:
 >> score BAYES_000.000
 >
 > why 0? current is -1.5 without and -1.9 with network checks...

Do you mean that the default is supposed to be -1.5 without networks
tests and -1.9 with network tests?


I went ahead and set BAYES_00 to -1.9 and I just received a spam message 
with these headers:


X-Spam-Status: No, score=0.204 tagged_above=-999 required=0.6
tests=[BAYES_00=-1.9, DCC_CHECK=1.1, FROM_STARTS_WITH_NUMS=0.738,
RP_MATCHES_RCVD=-0.735, URIBL_BLOCKED=0.001, URI_OPTOUT_3LD=1]
autolearn=disabled

From looking at it, it looks like the BAYES_00 took away -1.9 which 
made the difference of whether or not it got tagged as spam or not. I 
don't think -1.9 is the correct setting here. Any thoughts?






Re: Spam messages autolearned as ham

2014-09-25 Thread Axb

On 09/25/2014 03:26 PM, Deeztek Support wrote:

On 9/25/2014 6:31 AM, Matus UHLAR - fantomas wrote:
 > On 24.09.14 14:03, Deeztek Support wrote:
 >> score BAYES_000.000
 >
 > why 0? current is -1.5 without and -1.9 with network checks...

Do you mean that the default is supposed to be -1.5 without networks
tests and -1.9 with network tests?

 >> X-Spam-Status: No, score=0.257 tagged_above=-999 required=0.6
 >> tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, HTML_MESSAGE=0.001,
 >> RP_MATCHES_RCVD=-0.653, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
 >> T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham
 >
 > I recommend tou to clear score for RP_MATCHES_RCVD... apparently too
much
 > FNs as you can see here

How would I go about clearning out the RP_MATCHES_RCVD score?


If you disable the rule it will probably mess up a bunch of metas so the 
best may be to set


score RP_MATCHES_RCVD 0.001

in local.cf and restart amavis




Re: Spam messages autolearned as ham

2014-09-25 Thread Deeztek Support

On 9/25/2014 6:31 AM, Matus UHLAR - fantomas wrote:
> On 24.09.14 14:03, Deeztek Support wrote:
>> score BAYES_000.000
>
> why 0? current is -1.5 without and -1.9 with network checks...

Do you mean that the default is supposed to be -1.5 without networks 
tests and -1.9 with network tests?


>> X-Spam-Status: No, score=0.257 tagged_above=-999 required=0.6
>> tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, HTML_MESSAGE=0.001,
>> RP_MATCHES_RCVD=-0.653, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
>> T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham
>
> I recommend tou to clear score for RP_MATCHES_RCVD... apparently too much
> FNs as you can see here

How would I go about clearning out the RP_MATCHES_RCVD score?

>
> Btw, this is amavis output. I'm not sure if it has different logic for
> autolearning...
>

Indeed it is, I'm not aware of amavis playing a role in SA autolearning, 
does anyone have an input on this?




Re: Spam messages autolearned as ham

2014-09-25 Thread Matus UHLAR - fantomas

On 24.09.14 14:03, Deeztek Support wrote:

score BAYES_000.000


why 0? current is -1.5 without and -1.9 with network checks...


X-Spam-Status: No, score=0.257 tagged_above=-999 required=0.6
tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, HTML_MESSAGE=0.001,
RP_MATCHES_RCVD=-0.653, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham


I recommend tou to clear score for RP_MATCHES_RCVD... apparently too much
FNs as you can see here

Btw, this is amavis output. I'm not sure if it has different logic for
autolearning...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"Where do you want to go to die?" [Microsoft]


Re: Spam messages autolearned as ham

2014-09-24 Thread John Hardin

On Wed, 24 Sep 2014, Deeztek Support wrote:


Hello, I'm using the following spamassassin:

SpamAssassin version 3.3.2
  running on Perl version 5.10.1

On Ubuntu 10.04 LTS. I'm having a strange problem with messages being 
autolearned as ham even though they don't score low enough. Here's my 
local.cf config:



use_bayes 1
use_bayes_rules 1
bayes_auto_learn 1
bayes_auto_learn_threshold_nonspam-0.001
bayes_auto_learn_threshold_spam9.000
#override bayes default scores
score BAYES_000.000
score BAYES_803.000
score BAYES_954.000
score BAYES_994.500

Here's one of the message's in question headers:

X-Spam-Score: 0.257
X-Spam-Level:
X-Spam-Status: No, score=0.257 tagged_above=-999 required=0.6
 tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, HTML_MESSAGE=0.001,
 RP_MATCHES_RCVD=-0.653, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
 T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham


As you can see, it scored 0.257 but it autolearned as ham even though the 
bayes_auto_learn_threshold_nonspam is set to -0.001


Am I missing something here?


The Bayes score does not contribute to the autolearning decision. Take 0.8 
points off the raw total of 0.257 and you go negative enough to be 
autolearned.


Also: URIBL_BLOCKED - you really want to set up a local recursive 
(non-forwarding) DNS server for SA so that your URIBL lookups will work, 
that might help a lot.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Maxim IV: Close air support covereth a multitude of sins.
---
 846 days since the first successful private support mission to ISS (SpaceX)


Spam messages autolearned as ham

2014-09-24 Thread Deeztek Support

Hello, I'm using the following spamassassin:

SpamAssassin version 3.3.2
  running on Perl version 5.10.1

On Ubuntu 10.04 LTS. I'm having a strange problem with messages being 
autolearned as ham even though they don't score low enough. Here's my 
local.cf config:



use_bayes 1
use_bayes_rules 1
bayes_auto_learn 1
bayes_auto_learn_threshold_nonspam-0.001
bayes_auto_learn_threshold_spam9.000
#override bayes default scores
score BAYES_000.000
score BAYES_803.000
score BAYES_954.000
score BAYES_994.500

Here's one of the message's in question headers:

X-Spam-Score: 0.257
X-Spam-Level:
X-Spam-Status: No, score=0.257 tagged_above=-999 required=0.6
tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, HTML_MESSAGE=0.001,
RP_MATCHES_RCVD=-0.653, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=ham


As you can see, it scored 0.257 but it autolearned as ham even though 
the bayes_auto_learn_threshold_nonspam is set to -0.001


Am I missing something here?

Thanks in advance