RE: USPS Spam

2013-09-03 Thread Andrew Talbot
Just wanted to throw in my two cents here - I have spoken to USPS about this
and they said that they never send out these messages unless the client
requests them, and that it should be safe to completely block messages like
this. 

The same cannot be said about UPS and FexEx, by the way. 



> -Original Message-
> From: Matt [mailto:matt.mailingli...@gmail.com]
> Sent: Friday, August 30, 2013 4:23 PM
> To: users@spamassassin.apache.org
> Subject: USPS Spam
> 
> I am seeing tons of junk getting through claiming to be from the USPS
about a
> missed delivery package.  Anyone else seeing this?
> 
> I am running SpamAssassin 3.3.1 and execute sa-update weekly.



SUBJ_ALL_CAPS

2013-08-20 Thread Andrew Talbot
Hey all -

 

Does anybody know how long the string needs to be to trigger SUBJ_ALL_CAPS?
I know it has to be multi-word and over a certain length. Was wondering the
specific length. Thanks in advance J 

 

 



Whitelisting subdomains?

2013-08-14 Thread Andrew Talbot
Hey, all -

 

I'm trying to whitelist all our internal subdomains but I can't seem to get
it to work.

 

We have so many of them that it's impractical to do them individually. For
instance, we have _...@logs.domain.com, @admin-sql.domani.com etc. etc. etc.

 

I was thinking that whitelist_from *.domain.com would work but it doesn't 

 

I can't seem to find any documentation on the net anywhere - is it even
possible to do this? 

 

 



RE: PayPal spam filter?

2013-06-27 Thread Andrew Talbot
I just had to weigh in here to say that we have DCC_CHECK scored up to a 4, and 
all of these kinds of spam messages get caught by that because they always hit 
at least another 1 point worth of rules. 

Also, those two rules require plugins, I believe. 



> -Original Message-
> From: Juerg Reimann [mailto:j...@jworld.ch]
> Sent: Wednesday, June 26, 2013 6:42 PM
> To: users@spamassassin.apache.org
> Cc: 'Benny Pedersen'
> Subject: RE: PayPal spam filter?
> 
> Hi Benny
> 
> Thanks for your tip. Could you elaborate on this a bit? First of all, a rule 
> with
> the name SPF_DID_NOT_PASS or DKIM_DID_NOT_PASS seem not to exist.
> How and where would I configure this?
> 
> Thanks,
> Juerg
> 
> > -Original Message-
> > From: Benny Pedersen [mailto:m...@junc.eu]
> > Sent: Wednesday, June 12, 2013 9:38 PM
> > To: users@spamassassin.apache.org
> > Subject: Re: PayPal spam filter?
> >
> > Juerg Reimann skrev den 2013-06-12 21:30:
> >
> > > Is there a filter to block PayPal phishing mails, i.e. everything
> > > that claims to come from PayPal but is not?
> >
> > meta SPF_DID_NOT_PASS (!SPF_PASS)
> >
> > simple ? :=)
> >
> > if paypal do use dkim then it could be checked with
> >
> > meta DKIM_DID_NOT_PASS (!DKIM_VALID_AU)
> >
> > phishing emails seldom pass on this 2 tests
> >
> > --
> > senders that put my email into body content will deliver it to my own
> > trashcan, so if you like to get reply, dont do it




RE: "Chain" rules?

2013-06-24 Thread Andrew Talbot
This is what I was wondering. We don't want to have to run a
computationally-expensive body rule unless we need to. No choice though, I
guess. Thanks for your help!


> -Original Message-
> From: John Hardin [mailto:jhar...@impsec.org]
> Sent: Monday, June 24, 2013 1:20 PM
> To: users@spamassassin.apache.org
> Subject: Re: "Chain" rules?
> 
> On Mon, 24 Jun 2013, Andrew Talbot wrote:
> 
> > Is there a way to "chain" rules together such that one rule will only
> > fire if another is hit?
> >
> > Specifically, we have a client that is getting hit with a bunch of
> > messages that are just links, but the links contain sex words. We want
> > to do a body scan for a list of sex words if and only if the "body
contains
> only a link"
> > rule we have is triggered.
> >
> > I tried to get this to work with meta rules but it seems like it won't
> > do it. Is there currently a way to do this sort of conditional check?
> 
> Unfortunately you can't control whether or not a rule is *executed*, you
can
> only control whether or not it contributes to the message's overall score.
> 
> --
>   John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
>   jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
>   key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>Look at the people at the top of both efforts. Linus Torvalds is a
>university graduate with a CS degree. Bill Gates is a university
>dropout who bragged about dumpster-diving and using other peoples'
>garbage code as the basis for his code. Maybe that has something to
>do with the difference in quality/security between Linux and
>Windows.   -- anytwofiveelevenis on Y! SCOX
> ---
>   10 days until the 237th anniversary of the Declaration of Independence



"Chain" rules?

2013-06-24 Thread Andrew Talbot
Hey all -

 

Is there a way to "chain" rules together such that one rule will only fire
if another is hit? 

 

Specifically, we have a client that is getting hit with a bunch of messages
that are just links, but the links contain sex words. We want to do a body
scan for a list of sex words if and only if the "body contains only a link"
rule we have is triggered. 

 

I tried to get this to work with meta rules but it seems like it won't do
it. Is there currently a way to do this sort of conditional check? 

 

 



RE: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
Hi, Martin -

Thank you for your response. The original test was using a file arbitrarily 
named aa.html .. It still doesn't work with the rewrite you provided :/ 





> -Original Message-
> From: Martin Gregorie [mailto:mar...@gregorie.org]
> Sent: Friday, May 31, 2013 3:38 PM
> To: users@spamassassin.apache.org
> Subject: Re: Rule to scan for .html attachments?
> 
> On Fri, 2013-05-31 at 14:45 -0400, Andrew Talbot wrote:
> > I need it to fire on any HTML attachment. The modules are enabled. I
> > can get it to pick up text/html, remember, but the problem is that it
> > detects messages sent as HTML when it's set up like that. It doesn't
> > detect plain-text messages, but it will flag plain-text messages with
> > HTML files attached.
> >
> Well, that's exactly what your second rule won't do: it will only fire on the
> header of an html attachment for a file that has one of a very restricted set
> of filenames. As you haven't posted any example MIME header sets I can
> only guess, but my guess is that none of the messages you've tried it against
> have attachments with names that match the restriction.
> 
> As I said before the rule can't work with the '^' in place, because that says
> that the 'filename=' string must be at the beginning of a line and NOT
> preceded by any white space. Thats a harmful restriction because you never
> see MIME headers like that. With the '^' removed the rule
> becomes:
> 
> header HTML_ATTACH_RULE_2 Content-Disposition =~  /filename\=\"[a-
> z]{2}\.html\"/i
> 
> which has a better chance of working. This version will only fire if the
> filename associated with the attachment has precisely two alphabetic
> characters plus a .html extension, i.e. it will fire on filename="aa.html" or
> filename="ZZ.HTML" because the trailing 'i' makes it a caseless match, but it
> won't fire on filename="cat.html"
> or filename="x.html" because these don't have two character names and it
> won't fire if the attachment follows the common Windows convention of
> using a .htm extension.
> 
> If you want the rule to fire on *any* HTML attachment it should be:
> 
> header HTML_ATTACH_RULE_2 Content-Disposition =~
> /filename\=\".{0,30}\.html{0,1}\"/i
> 
> which will match any filename with a .html or .htm extension (including
> ".html" and ".htm").
> 
> Could I respectfully suggest that you learn about Perl regular expressions
> before you try writing any more SA rules? SA rules are all based on using the
> Perl flavour of regular expressions to match character strings in headers and
> the message body.
> 
> You could do a lot worse than getting a copy of "Programming Perl" by Larry
> Wall, Tom Christiansen & Jon Orwant, published by O'Reilly. If there isn't one
> in the firm's technical library, they should be willing to buy a copy. Its a 
> brick
> of a book, but you only need to read "Chapter
> 5: Pattern Matching" to write SA rules and in any case the rest of its 
> contents
> will come in handy in future if anybody needs to write Perl programs or SA
> extension modules.
> 
> 
> Martin
> 
> 
> 
> 




RE: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
I need it to fire on any HTML attachment. The modules are enabled. I can get it 
to pick up text/html, remember, but the problem is that it detects messages 
sent as HTML when it's set up like that. It doesn't detect plain-text messages, 
but it will flag plain-text messages with HTML files attached. 


> -Original Message-
> From: Martin Gregorie [mailto:mar...@gregorie.org]
> Sent: Friday, May 31, 2013 2:35 PM
> To: users@spamassassin.apache.org
> Subject: Re: Rule to scan for .html attachments?
> 
> On Fri, 2013-05-31 at 14:10 -0400, Andrew Talbot wrote:
> > That didn't work :(
> >
> Can you post one or two examples of actual MIME attachment headers that
> you're trying to get the rule to fire on?
> 
> Obvious question, but have you enabled the MIME header module?
> I'm using MimeMagic and enabling it requires that MimeMagic.pm and
> MimeMagic.cf be included in /etc/mail/spamassassin (or wherever you have
> told SA to look for its configuration etc.
> 
> 
> Martin
> 
> 




RE: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
That's what I was afraid of. We generally avoid those kinds of rules since
we are scanning millions of messages a day. 

> -Original Message-
> From: David F. Skoll [mailto:d...@roaringpenguin.com]
> Sent: Friday, May 31, 2013 2:22 PM
> To: users@spamassassin.apache.org
> Subject: Re: Rule to scan for .html attachments?
> 
> On Fri, 31 May 2013 14:10:36 -0400
> Andrew Talbot  wrote:
> 
> > That didn't work :(
> 
> What didn't work?  Oh... you top-posted.
> 
> Anyway... you might need a "full" rule, which can be expensive.
> Something like:
> 
> full HTML_RULE /Content-
> Disposition:.{0,50}name\s{0,2}=\s{0,2}\"?.{0,50}\.html?/i
> 
> Completely untested, of course! :)
> 
> Regards,
> 
> David.



Re: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
Didn't work with mime_header (or mimeheader) with either rule.


On Fri, May 31, 2013 at 12:23 PM, Axb  wrote:

> On 05/31/2013 05:51 PM, Andrew Talbot wrote:
>
>> Hey all -
>>
>> I'm trying to set up a custom rule that scores HTML attachments.
>>
>> The problem I'm running across is that using a rule like this one:
>> mimeheader HTML_ATTACH Content-Type =~ /^text\/html/i
>>
>> Will flag all messages that come in as HTML (vs. plain text).
>>
>> I found this :
>> header HTML_ATTACH_RULE_2 Content-Disposition =~
>> /^filename\=\"[a-z]{2}\.html\"**/i
>>
>> But that doesn't ... Work ... At all.
>>
>>
>> Any suggestions? Is this even possible?
>>
>>
> use mime_header instead of header
>


Re: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
That didn't work :(



On Fri, May 31, 2013 at 12:40 PM, Martin Gregorie wrote:

> On Fri, 2013-05-31 at 11:51 -0400, Andrew Talbot wrote:
> > I'm trying to set up a custom rule that scores HTML attachments.
> >
> ..snippage..
>
> > I found this :
> header HTML_ATTACH_RULE_2 Content-Disposition =~
> > /^filename\=\"[a-z]{2}\.html\"/i
> >
> Don't anchor it to the start of the line, i.e. try this:
>
> header HTML_RULE Content-Disposition =~ /filename\=\"[a-z]{2}\.html\"/i
>
> I have a very similar rule for matching ZIP file attachments whose name
> is xx.zip which works as expected. The only significant difference from
> your rule is that it doesn't use the '^' BOL anchor symbol. My guess is
> that SA's body text parser converts the MIME header into one line, so
> requiring 'filename' to be at the start of the line will always fail.
>
>
> Martin
>
>
>
>


Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
Hey all -

I'm trying to set up a custom rule that scores HTML attachments.

The problem I'm running across is that using a rule like this one:
mimeheader HTML_ATTACH Content-Type =~ /^text\/html/i

Will flag all messages that come in as HTML (vs. plain text).

I found this :
header HTML_ATTACH_RULE_2 Content-Disposition =~
/^filename\=\"[a-z]{2}\.html\"/i

But that doesn't ... Work ... At all.


Any suggestions? Is this even possible?


Re: Bayes + DCC / Bayes as a false-positive killer

2013-05-29 Thread Andrew Talbot
Hi, Matus -

I wanted to ask you about your last point about the bayes9x fps and the 0x
fns, mostly because it seems like that contradicts the sentence that
follows (that you don't consider it to be 100%). If there's no FNs or FPs,
it's about as good as it gets, no?


On Wed, May 29, 2013 at 3:13 AM, Matus UHLAR - fantomas
wrote:

> On 28.05.13 16:43, Andrew Talbot wrote:
>
>> That said, I'm wondering if it's redundant to run DCC and Bayes at the
>> same
>> time? From what I understand, DCC is a subscription-based service, so it
>> would be nice to be able to cut that cost out!
>>
>
> No, it is not. It only requires you using other than public DCC servers
> when
> your daily rate is over 200k.  The server must share the checksums with the
> DCC network (otherwise you couldn't catch those spams even).  If you have
> that many messages daily, it would not be even a bad idea have DCC locally.
>
>
>  score, but we'd trust Bayes to subtract points from messages it is
>> confident are ham.
>>
>
> I rarely have BAYES_9x FPs and BAYES_0x FNs. While BAYES is great, I don't
> consider it to be 100%
> --
> Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
> Warning: I wish NOT to receive e-mail advertising to this address.
> Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
> Saving Private Ryan...
> Private Ryan exists. Overwrite? (Y/N)
>


Re: Bayes + DCC / Bayes as a false-positive killer

2013-05-29 Thread Andrew Talbot
Hi there, RW-

Thank you for your response. A lot of interesting points in there. The
issue with something like Bogofilter or its ilk is that it:
1- Requires manual intervention from users (we don't have access to the
content of their messages)
2- Apparently doesn't scale well to huge client bases with all kinds of
diverse businesses. Our clients range from banking institutions to
employment agencies to ... ehh... purveyors of adult objects. So its tough
to find commonalities, and since we're so large, we can't exactly have
different user accounts for each.

Go figure.

Bayes performs beautifully in my test environment. I just need to find
that extra "WOW" factor. I thought that saving the cost on DCC would be it
but ... That didn't seem to make a difference. Go figure.


On Wed, May 29, 2013 at 8:02 AM, RW  wrote:

> On Tue, 28 May 2013 16:43:20 -0400
> Andrew Talbot wrote:
>
> > Hey all -
> >
> > I've got two questions:
> >
> > 1-
> >
> >...
> > That said, I'm wondering if it's redundant to run DCC and Bayes at
> > the same time? From what I understand, DCC is a subscription-based
> > service, so it would be nice to be able to cut that cost out!
>
> It depends what you mean by DCC, the basic version is free, but is
> actually only a a way of identifying *bulk* mail which is why it
> doesn't score all that much. The paid version is a reputation system, it
> doesn't get discussed much here.
>
> Spamassassin is score-based, it doesn't rely on poison-pill rules. It
> doesn't matter that all DCC hits are also Bayes hits provided that
> the FPs and FNs don't also overlap and some spam that hits Bayes is
> pushed over the 5 point threshold by DCC.
>
>
> > As some of you may have known from talking with me over the past few
> > weeks, I've been having a difficult time 'selling' my bosses on the
> > idea of Bayes; it simply doesn't seem to do anything new to them. But
> > looking at the data today, I came up with an idea: use Bayes to
> > reduce false positives.
> >
> > That would mean we'd completely nerf the rules that add points to the
> > score, but we'd trust Bayes to subtract points from messages it is
> > confident are ham.
> >
> > I am aware of how silly that sounds. But would it work? We don't have
> > another way to filter out false positives - we've got tons of ways to
> > add points!
>
> Reducing FPs is already one of the main benefits of Bayes. The trouble
> is that if you rescore it,  you will still be using the Bayes scoreset
> that's optimized around Bayes doing a lot of the spam catching.
>
> I think you'd be better-off scoring Bogofilter, or a similar filter with
> 3-way clustering, into SpamAssassin. You still have the problem of
> learning representative ham if you want accurate ham identification.
>


Re: Bayes + DCC / Bayes as a false-positive killer

2013-05-29 Thread Andrew Talbot
Hi, Dave -

We don't have anything else learning because we deal in such bulk. We're an
email service provider hosting hundreds of thousands of accounts.

Re: Your last line about "I don't understand what their concerns are"
... Welcome to my world. Right now I am manually writing rules - custom
rules - based on the subject lines (only the subject lines) of spam that
gets reported to us. We are very very clearly Doing It Wrong, so I'm trying
to find a way to do it better.

As far as why we can't have Bayes and DCC on at the same time I've got
no idea.

I just work here, Dave! :)

Thank you for your response.


On Tue, May 28, 2013 at 8:12 PM, Dave Warren  wrote:

> On 2013-05-28 13:43, Andrew Talbot wrote:
>
>> As some of you may have known from talking with me over the past few
>> weeks, I've been having a difficult time 'selling' my bosses on the idea of
>> Bayes; it simply doesn't seem to do anything new to them. But looking at
>> the data today, I came up with an idea: use Bayes to reduce false positives.
>>
>
> Do you have anything else that heuristically learns from your mail and
> adapts in real-time to your mail flow?
>
>
>
>
>> That would mean we'd completely nerf the rules that add points to the
>> score, but we'd trust Bayes to subtract points from messages it is
>> confident are ham.
>>
>> I am aware of how silly that sounds. But would it work? We don't have
>> another way to filter out false positives - we've got tons of ways to add
>> points!
>>
>> What do ya'll think?
>>
>
> I think it's a great idea, but that I wouldn't zero out the positive score
> unless it's hurting you, I think I'd just let it do what it does.
>
> If it saves you a subscription service, then that alone should be a strong
> selling point, unless there are false positives (and if so, I'd look into
> tuning your ham training before abandoning all hope)
>
> I guess part of it is that I don't understand what their concerns are with
> using Bayesian learning?
>
> --
> Dave Warren
> http://www.hireahit.com/
> http://ca.linkedin.com/in/**davejwarren<http://ca.linkedin.com/in/davejwarren>
>
>


Bayes + DCC / Bayes as a false-positive killer

2013-05-28 Thread Andrew Talbot
Hey all -

I've got two questions:

1-

We're running Bayes and DCC on our server, and we've just been running
Bayes locally to see how well it works. It's been about three weeks now so
I finally really started poring over the results.

One thing I noticed that I thought was a particularly interesting anomaly:
Bayes caught 100% of what DCC caught. 100%. Without exception - in
thousands of messages. The reverse wasn't true at all.

That said, I'm wondering if it's redundant to run DCC and Bayes at the same
time? From what I understand, DCC is a subscription-based service, so it
would be nice to be able to cut that cost out!



2-

As some of you may have known from talking with me over the past few weeks,
I've been having a difficult time 'selling' my bosses on the idea of Bayes;
it simply doesn't seem to do anything new to them. But looking at the data
today, I came up with an idea: use Bayes to reduce false positives.

That would mean we'd completely nerf the rules that add points to the
score, but we'd trust Bayes to subtract points from messages it is
confident are ham.

I am aware of how silly that sounds. But would it work? We don't have
another way to filter out false positives - we've got tons of ways to add
points!

What do ya'll think?


Bayes autolearning: logarithmic?

2013-05-22 Thread Andrew Talbot
Hey all -

I set up Bayes with autolearning a few weeks ago. It took forever to get
started, but now it seems like the learning speed has accelerated.

Is the autolearning supposed to accelerate? I can't help but feel like it
may just be feeding itself it's own data or something.


RE: Default Bayes Database

2013-05-10 Thread Andrew Talbot
You all are keeping me sane and grounded as I deal with the Powers That Be
here trying to set this up. It's good to know that I'm not wrong (I agree
with everything everyone has said, and pointed out from the beginning a
default database would be awful). 

And this:  "If he insists on starting with a pre-populated Bayes database,
he sure knows why. Other than "I'm the boss, I want.""  ... Is exactly
right too. 

We're implementing it locally with auto-learning enabled this weekend (oh,
yeah, boss didn't want auto-learning enabled either..). 

So here goes!! 

Thanks for all your help. 


> -Original Message-
> From: Karsten Bräckelmann [mailto:guent...@rudersport.de]
> Sent: Wednesday, May 08, 2013 8:18 PM
> To: users@spamassassin.apache.org
> Subject: Re: Default Bayes Database
> 
> On Wed, 2013-05-08 at 14:09 -0400, Andrew Talbot wrote:
> > Well, I certainly hope someone offers to help!
> 
> Heh! I am really confident, Alex didn't mean to be rude, neither that he
> actually hopes no one will help you. Quite the contrary...
> 
> He DID try to help you by explaining why a "default Bayes database" is a
bad
> idea in the first place. And that was his way of telling you...
> 
> > If only to say "there is no default database."
> 
> That. :)  There is none, and there never has been.
> 
> 
> > As we've spoken about off-list, my boss is being very particular about
> > the deployment of Bayes, and it sounds like one of his caveats is that
> > we don't start from a blank database.
> 
> I can see how the idea of basing off of some "known to be classified"
> tokens sounds tempting. However, there is no such token. None. Just try to
> imagine working in an industry where e.g. Viagra and Cialis are totally
legit
> phrases to use...
> 
> Feel free to direct your boss here. If he insists on starting with a pre-
> populated Bayes database, he sure knows why. Other than "I'm the boss, I
> want."
> 
> 
> Anyway, Andrew, your idea of that whole "blank slate" is inaccurate. If
you
> import someone else's data, before importing your database has been
> empty.
> 
> If you collect some ham and spam for initial training, before training
your
> database has been empty.
> 
> You even do NOT have to deploy SA prior to that. I don't know the size of
> your user base, but it seems it shouldn't be hard to have a few of the
users
> chip in. Get a few of them to collect hand-classified ham and spam for
you.
> Train Bayes with that. After that, deploy SA to your mail processing
chain.
> 
> There you go! A pre-populated Bayes database, based on YOUR particular
> ham and spam tokens, before deploying SA in production.
> 
> 
> --
> char
> *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4
> ";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0;
}}}




RE: Default Bayes Database

2013-05-08 Thread Andrew Talbot
Well, I certainly hope someone offers to help! 

If only to say "there is no default database." 

As we've spoken about off-list, my boss is being very particular about the
deployment of Bayes, and it sounds like one of his caveats is that we don't
start from a blank database. 

For the record, I agree with your logic completely .. And I hate to say
stupid things like this, but it doesn't even matter to me if the tokens in
the default database are useless at this point, or if there are only 20 of
them. I just need to get this deployed so it can start learning. 




> -Original Message-
> From: Axb [mailto:axb.li...@gmail.com]
> Sent: Wednesday, May 08, 2013 1:32 PM
> To: users@spamassassin.apache.org
> Subject: Re: Default Bayes Database
> 
> On 05/08/2013 07:26 PM, Andrew Talbot wrote:
> > Hey all -
> >
> > I remember seeing somewhere that there was a default Bayes database
> > for Bayes to start using right away, but can't seem to find that
> > information again on the Wiki or in my notes.
> >
> > Can someone please help?
> 
> I hope nobody offers to help.
> 
> Why?
> - your HAM is somebody else's SPAM
> - A decent Bayes DB is highly dynamic and yesterday's tokens from someone
> else's traffic will be useless to you traffic, today.
> - If you have a decent traffic flow, it takes less than 4 hours of
autolearning
> with YOUR data and see Bayes scoring.
> 




Default Bayes Database

2013-05-08 Thread Andrew Talbot
Hey all -

I remember seeing somewhere that there was a default Bayes database for
Bayes to start using right away, but can't seem to find that information
again on the Wiki or in my notes.

Can someone please help?


RE: Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Hey there, thanks for responding. That's an interesting point.

Are you saying I should not use autolearning at all? 

I don't have any way to review a large corpus of messages because we don't
have access to them - after they run through our servers they are sent on,
and the text of the message is not stored on our server. 

Man, I wish there was an easier way to feed Bayes an initial set of spam/ham
to teach it properly .. I've been told that letting it autolearn for a few
hours/days would make it learn well enough though.

If only our mail server only got 100 messages a day - then I could just
manually mark them! :) 




> -Original Message-
> From: RW [mailto:rwmailli...@googlemail.com]
> Sent: Wednesday, May 01, 2013 6:24 PM
> To: users@spamassassin.apache.org
> Subject: Re: Bayes Autolearning
> 
> On Wed, 01 May 2013 22:02:43 +0100
> Steve Freegard wrote:
> 
> > On 01/05/13 19:40, Andrew Talbot wrote:
> > > Hi, Seve -
> > >
> > > Thanks for your response. Is that just for performance reasons?
> > >
> >
> > Performance is one of the things that bayes_auto_learn_on_error 1 will
> > give you.  It means that if the message was already considered spam by
> > Bayes, then the message won't be autolearnt again which means
> > a bit less IO.   It will also result in the Bayes databases being
> > smaller as it is likely that with this option that less tokens will be
> > present overall which will also save disk IO and space.
> >
> > But the key reason I like this option is that it doesn't allow bayes
> > to overtrain in one direction (e.g. spam or ham).  It only autolearns
> > when Bayes either has the wrong result or isn't sure which IMO has to
> > be better for accuracy in the long run.
> 
> The evidence from trials with Bogofilter (which is similar to Bayes)
showed
> that initially train-on-everything significantly outperforms
train-on-error. The
> latter asymptotically catches up after thousands of errors. It seems that
the
> most important thing  is to learn a few thousand hams and spams by any
> means; and train-on-error can take a long time to get there. For this
reason
> DSPAM only allows train-on-error when 2500 hams have been learned.
> 
> There *may* be advantages to train-on-error after this in preventing BAYES
> becoming insensitive to learning.
> 
> The chief problem with autolearning is learning ham. If you set a positive
> threshold you end-up learning a lot of spam as ham, if you set a negative
> threshold you effectively turn-over ham training to the DNS whitelists
since
> they are the only tests with  significant negative scores that aren't
excluded
> from autolearning. Any problems with miss-learning are likely to be
> exacerbated by train-on-error.
> 
> If I had to use autolearning I'd mark the DNS whitelists as noautolearn
and
> write some negative-scoring, site-specific rules.



RE: Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Hi, Seve -

Thanks for your response. Is that just for performance reasons?




> -Original Message-
> From: Steve Freegard [mailto:steve.freeg...@fsl.com]
> Sent: Wednesday, May 01, 2013 2:24 PM
> To: users@spamassassin.apache.org
> Subject: Re: Bayes Autolearning
> 
> All good advice there from Axb; the only thing I'd add to that is:
> 
> bayes_auto_learn_on_error 1
> 
> Which prevents Bayes from over-training when the classifier already agrees
> with what the autolearn is trying to train on.
> 
> Cheers,
> Steve.
> 
> On 01/05/13 19:14, Axb wrote:
> > On 05/01/2013 08:01 PM, Andrew Talbot wrote:
> >
> >> Any suggestions any of you have for a Bayes newbie - about what I
> >> just asked or otherwise - would be very much appreciated.
> >
> > I advocate autolearning as it has always worked fine for me.
> > Can take  a bit longer to see good results but with some tuning I can
> > sit back and hear it purr and not worry about collecting ham and spam
> > and training, which under certain circumstances may even be impossible.
> >
> > Before moving on to Redis, these were my bayes settings
> >
> > # bayes.cf
> >
> > use_bayes 1
> > bayes_auto_learn  1
> > bayes_auto_expire  0
> >
> > bayes_learn_to_journal 0
> >
> > # Dont' want to wait for the deault 200 hams/spams bayes_min_ham_num
> > 20 bayes_min_spam_num 20
> >
> > bayes_auto_learn_threshold_nonspam -1.0
> > bayes_auto_learn_threshold_spam 15.0
> >
> >
> > # FILE BASED
> > # mkdir /etc/bayes
> > bayes_path /etc/mail/spamassassin/bayes/bayes
> >
> > # Check permsisions/modify if needed
> > #bayes_file_mode 0666
> >
> > bayes_expiry_max_db_size 35
> > # SDBM is faster than other r/w  DBs
> > bayes_store_module   Mail::SpamAssassin::BayesStore::SDBM
> >
> > # cron weekly
> > #  sa-learn --force-expire
> >
> >
> 




RE: Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Thank you for that! 

Off-list you mentioned that you don't need to set the cron/expire because of
Redis features; why is it commented out here? 




> -Original Message-
> From: Axb [mailto:axb.li...@gmail.com]
> Sent: Wednesday, May 01, 2013 2:14 PM
> To: users@spamassassin.apache.org
> Subject: Re: Bayes Autolearning
> 
> On 05/01/2013 08:01 PM, Andrew Talbot wrote:
> 
> > Any suggestions any of you have for a Bayes newbie - about what I just
> > asked or otherwise - would be very much appreciated.
> 
> I advocate autolearning as it has always worked fine for me.
> Can take  a bit longer to see good results but with some tuning I can sit
back
> and hear it purr and not worry about collecting ham and spam and training,
> which under certain circumstances may even be impossible.
> 
> Before moving on to Redis, these were my bayes settings
> 
> # bayes.cf
> 
> use_bayes 1
> bayes_auto_learn  1
> bayes_auto_expire  0
> 
> bayes_learn_to_journal 0
> 
> # Dont' want to wait for the deault 200 hams/spams bayes_min_ham_num
> 20 bayes_min_spam_num 20
> 
> bayes_auto_learn_threshold_nonspam -1.0
> bayes_auto_learn_threshold_spam 15.0
> 
> 
> # FILE BASED
> # mkdir /etc/bayes
> bayes_path /etc/mail/spamassassin/bayes/bayes
> 
> # Check permsisions/modify if needed
> #bayes_file_mode 0666
> 
> bayes_expiry_max_db_size 35
> # SDBM is faster than other r/w  DBs
> bayes_store_module   Mail::SpamAssassin::BayesStore::SDBM
> 
> # cron weekly
> #  sa-learn --force-expire




Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Hey All -

 

I'm about to set up Bayes on one of our mail servers. A lot of the
documentation says that I need to manually sift through a few hundred
messages and classify them to 'teach' the filter, and it sounds like I may
need to do that on an ongoing basis. 

 

That is not a very plausible solution - our servers process about 2million
messages a day.

 

Does Bayes start out with a completely blank slate? That is, if I never have
it learn anything from my servers, will it still be pulling from something
already defined?

 

Can I set it to autolearn and leave it be? Or will it require continual
maintenance and manual message feeding?

 

Any suggestions any of you have for a Bayes newbie - about what I just asked
or otherwise - would be very much appreciated J

 

 

 

 

 



RE: More longer rules or fewer shorter ones?

2013-04-26 Thread Andrew Talbot
Martin -

Interesting. How many mailboxes does your deployment cover?



-Original Message-
From: Martin Gregorie [mailto:mar...@gregorie.org] 
Sent: Thursday, April 25, 2013 8:08 PM
To: users@spamassassin.apache.org
Subject: Re: More longer rules or fewer shorter ones?

On Thu, 2013-04-25 at 18:45 -0400, Andrew Talbot wrote:

> I like your point about the portmanteau rules (and I award you two 
> Points for using one of my favorite words in a new - yet appropriate - 
> manner!).
> 
:-)


> I never thought about scoring each rule as a 0.001 or something really 
> low then tying them all together with meta-rules. It's been a while 
> since I separated everything out but I believe I have around 1000 
> different checks (most of them portmanteau'd) so it seems like those 
> meta rules would just get ... Messy. But it's a good idea, and I think 
> I can especially make use of it in my Specific Word list.
> 
The metas aren't too bad, though I must admit to building some of them as metas 
of metas to keep all lines down to 72 chars or so. Most of these submetas are 
simply lists of other rules that have been ANDed or ORed together.

You may find that the Portmanteau Generator reduces your rule count because it 
too can generate metas, which I use to deal with situations where a term can 
appear in more than one case, e.g. a generated rule can have this form:

describe GENRULE Example rule  
header   __GR1   Reply-to =~ /(\@spam1\.com|\@spammer\.co\.uk|)
header   __GR2   From =~ /(\@spam1\.com|\@spammer\.co\.uk|)
uri  __GR3   From =~ /(\@spam1\.com|\@spammer\.co\.uk|)
meta GENRULE (__P1 || __P2 || __P3)
scoreGENRULE 1.5

which has two advantages. First, that GENRULE is a single name that covers the 
same spammy term regardless of where it was used and secondly, since each 
generated rule has its own source file, this makes the three related lists 
easier to edit, since there's a good chance that a spammy term might be used in 
more than one of the related lists.
  
> Keeping the rules under 1-2mb is a good rule of thumb to follow.
> Luckily we're nowhere near that point yet. 
> 
Nor am I. As I said, my biggest generated rule is a bit over 9 KB.

> Can I ask how many rules you have, and how many of those are meta 
> rules?
>
I have 31 portmanteau rules, of which 9 contain metas. Only 12 of these have a 
score exceeding 1.0 and these are not usually used as part of higher level 
metarules

My local.cf is where any very specific rules live, along with the higher level 
metarules that use the low scoring portmanteau rules. This contains 129 rules 
which between them contain 96 'meta' statements. 36 of these have scores of 
under 1.0, so are probably used as components of metarules.  The total number 
of rules was obtained by using grep+wc to count lines containing '^score'.

my local.cf and portmanteau.cf files are both 29 KB in size.


Martin







Re: More longer rules or fewer shorter ones?

2013-04-25 Thread Andrew Talbot
Hi, Martin -



Thank you for your response.



I like your point about the portmanteau rules (and I award you two Points
for using one of my favorite words in a new - yet appropriate - manner!).



I never thought about scoring each rule as a 0.001 or something really low
then tying them all together with meta-rules. It's been a while since I
separated everything out but I believe I have around 1000 different checks
(most of them portmanteau'd) so it seems like those meta rules would just
get ... Messy. But it's a good idea, and I think I can especially make use
of it in my Specific Word list.



It's interesting that you don't use Bayes for the opposite reason that we
don't - we don't do it because of high volume, you don't do it because of
low volume. Go figure.



Keeping the rules under 1-2mb is a good rule of thumb to follow. Luckily
we're nowhere near that point yet.





Can I ask how many rules you have, and how many of those are meta rules?













-Original Message-

From: Martin Gregorie [mailto:mar...@gregorie.org]

Sent: Wednesday, April 24, 2013 3:03 PM

To: users@spamassassin.apache.org

Subject: Re: More longer rules or fewer shorter ones?



On Wed, 2013-04-24 at 12:32 -0400, Andrew Talbot wrote:

> I have my customized deployment split up into a bunch of separate CF

> files (by category) and I have those further split up into rules based

> on score.

>

I also use very long rules, mainly due to spamiferous mailing lists,
because all the headers are pretty much the same (apart from sender names),
so about all you're left with for spam recognition is the body content.



I found a problem with very long rules, where for me 'very long' means
"rules longer than the width of my editor's screen". I refer to these as
'portmanteau rules' (private slang). As I hate editing anything that's
longer than my editor's text line and find it particularly annoying to deal
with such a line containing a regex consisting of a lot of alternates, I
wrote a portmanteau rule generator to make their maintenance a bit easier.
It is a gawk script that assembles an arbitrarily long rule from a file
containing rule fragments (regexes,

etc) that are each placed on a separate line. Since sounds as though you
may have a similar problem, you may also find it useful. You can find it
and its documentation here:

http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz



I find it particularly helpful to make the portmanteau rules fairly low
scoring and to combine them into higher scoring meta-rules, e.g. if I'm
trapping sales spiel I'll have a portmanteau rule listing selling phrases,
one containing monetary terms and another containing product terms and
names, all scores at 0.001. I'll also have a meta-rule that ANDs these
three rules together and scores around 5. This approach is much better at
distinguishing spam from ham than a series of higher scoring non-meta rules
and has the additional benefit of recognising sales-related text from
previously unseen combinations of elements in the three rules.



BTW, I don't use Bayes because my mail volume is small and I have
difficulty collecting decent training corpuses and find my current setup
easier to manage.





  They are WAY longer than that (and some of them include further nesting
of the pipe), but that's the general idea.



> My question is: is it better performance-wise to have the rules set up

> like this, or to have each separate thing have its own separate rule?

>

What JH said. When I was thinking of setting up this approach I asked about
performance and limits on the size of the generated rules and was told that
I shouldn't worry about rule size until they exceeded a megabyte or two.
Currently my longest rule is just over 9KB, with the averages being just
under 1KB and 51 alternates per rule.



Martin


RE: More longer rules or fewer shorter ones?

2013-04-24 Thread Andrew Talbot
Hi again, John -

It's a good idea to add the realtime rules to the beginning of the filter. I
didn't realize that would have such an impact. And the (?=x) tip is a good
one too; thank you for that.

As far as Bayes, don't get me started! :)  I work for an Email Service
Provider and about 2 million messages go through our servers every day, so
we have Bayes turned off because it would be too computationally expensive.
I wish we could turn it on - it'd certainly make my job easier - but The
Boss says no. Go figure. Autolearn, same story. 

Having such a large organization makes it a difficult balance to avoid false
positives, too. We have one client who deals with credit reports and
refinancing and stuff and pretty much every message that goes to their
mailboxes looks like spam. We just have them set up to avoid all our
financial rules. 

Luckily we don't have too many doctors' offices so we needn't really concern
ourselves with legitimate Viagra email! :) 

I've scoured the net looking for rulesets from others that already have a
lot of this stuff in there but I haven't found any rulesets since 2006. A
lot of what I've seen is irrelevant - do you know a good place to get custom
rulesets? I feel like there's someone else out there who already figured out
how to write a rule that captures all those "learn a new language" spam
messages so I don't need to just score "Language" as +4 ! : )





-Original Message-
From: John Hardin [mailto:jhar...@impsec.org] 
Sent: Wednesday, April 24, 2013 1:53 PM
To: users@spamassassin.apache.org
Subject: RE: More longer rules or fewer shorter ones?

On Wed, 24 Apr 2013, Andrew Talbot wrote:

> John,
>
> Thanks for your prompt response!
>
> A lot of the rules are big jumbles of rules we are generating in real 
> time and adding to as things come in. Like I said in my original 
> question, we have them separated into separate cf files by category, 
> and within those cf files they are separated by score. So we have just 
> absolutely gargantuan rules for (for instance) sex words that we assign a
5 to automatically.
> There's also lists of specific words and phrases that we see in 
> real-time spam (like the *$#ing garden hose spam).
>
> We are just tacking new rules on to the end to make them easier to 
> read. Our rules properly work with (this|that|theother) if it hits any 
> one of the words.
>
> Should we maybe have separate rules for all the phrases, since they're 
> longer strings? There's rules in there that are like RULE Subject =~
> /you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah
> |blah)
> )|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . .  .
.
> .
>
> Etc. It goes on. .. My syntax is terrible and obviously those aren't 
> the actual rules but the point is that it's a bunch of "Or" for some 
> really long strings. Should I separate them out and have those long 
> (this|that|theother) rules be only for single words?

Simple alternations on phrases are equivalent to simple alternations on
single words with respect to the performance concerns. Performance is more
governed by the number of alternations and the presence of repetition and
.* than their raw length. You might want to limit the total number of
alternations per rule.

Another performance optimization would be to ensure all of the alternations
in a given rule start with the same letter, and put (?=x) before the list of
alternatatives e.g. /\b(?=x)(x1|x2|x3|x4)/ so that the engine can skip more
easily.

If they are simple alternations, it also depends on how you want to score
them.

For "poison pill" words or phrases, sure, a long alternation with a high
score will be pretty efficient. I'd suggest tacking new hits onto the
*front* of the list of alternatives, though, as it's reasonable to assume a
spam run will use the same phrasing for a while, then change.

> Alternately, should I separate out the rules with embedded pipes in 
> them (like in the example above)?

Yeah, avoiding nested alternatives where possible will help.

Is Bayes not catching things like this?

> -Original Message-----
> From: John Hardin [mailto:jhar...@impsec.org]
> Sent: Wednesday, April 24, 2013 12:58 PM
> To: users@spamassassin.apache.org
> Subject: Re: More longer rules or fewer shorter ones?
>
> On Wed, 24 Apr 2013, Andrew Talbot wrote:
>
>> Hey, all -
>>
>> I have my customized deployment split up into a bunch of separate CF 
>> files (by category) and I have those further split up into rules 
>> based on
> score.
>>
>> So, I have a bunch of stuff like:
>>
>> header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i
>> score RULE_1 1
>> describe RULE_1 Rule 1
>

RE: More longer rules or fewer shorter ones?

2013-04-24 Thread Andrew Talbot
John, 

Thanks for your prompt response!

A lot of the rules are big jumbles of rules we are generating in real time
and adding to as things come in. Like I said in my original question, we
have them separated into separate cf files by category, and within those cf
files they are separated by score. So we have just absolutely gargantuan
rules for (for instance) sex words that we assign a 5 to automatically.
There's also lists of specific words and phrases that we see in real-time
spam (like the *$#ing garden hose spam).

We are just tacking new rules on to the end to make them easier to read. Our
rules properly work with (this|that|theother) if it hits any one of the
words. 

Should we maybe have separate rules for all the phrases, since they're
longer strings? There's rules in there that are like RULE Subject =~
/you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah|blah)
)|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . .  . .
. 


Etc. It goes on. .. My syntax is terrible and obviously those aren't the
actual rules but the point is that it's a bunch of "Or" for some really long
strings. Should I separate them out and have those long (this|that|theother)
rules be only for single words?

Alternately, should I separate out the rules with embedded pipes in them
(like in the example above)? 


-Original Message-
From: John Hardin [mailto:jhar...@impsec.org] 
Sent: Wednesday, April 24, 2013 12:58 PM
To: users@spamassassin.apache.org
Subject: Re: More longer rules or fewer shorter ones?

On Wed, 24 Apr 2013, Andrew Talbot wrote:

> Hey, all -
>
> I have my customized deployment split up into a bunch of separate CF 
> files (by category) and I have those further split up into rules based on
score.
>
> So, I have a bunch of stuff like:
>
> header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i
> score RULE_1 1
> describe RULE_1 Rule 1
>
> header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe 
> RULE_2 Rule 2
>
> They are WAY longer than that (and some of them include further 
> nesting of the pipe), but that's the general idea.
>
> My question is: is it better performance-wise to have the rules set up 
> like this, or to have each separate thing have its own separate rule?

For performance, with simple lists of variant values having no repetition
across the list e.g. (x|y|z){n,m}, if the most-likely variants are listed
first a "big" rule will (generally-speaking) process less than a set of
individual rules for each variant. The big rule will stop trying as soon as
a match for one variant is found, whereas all of the individual rules must
be tried regardless of what other rules may have hit. RULE_1 won't try
matching "that", "theother", "blah", etc. if "this" matches.

Ignoring performance, the alternatives are *not* syntactically equivalent. 
Absent "tflags multiple", RULE_1 would hit only once on a subject containing
both "this" and "that" and "theother", but if you split it up into separate
rules *each* would hit. This likely would affect scoring.

-- 
  John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
  jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
   Vista "security improvements" consist of attempting to shift blame
   onto the user when things go wrong.
---
  328 days since the first successful private support mission to ISS
(SpaceX)



More longer rules or fewer shorter ones?

2013-04-24 Thread Andrew Talbot
 

Hey, all -

 

I have my customized deployment split up into a bunch of separate CF files
(by category) and I have those further split up into rules based on score.

 

So, I have a bunch of stuff like:

header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i

score RULE_1 1

describe RULE_1 Rule 1

 

header RULE_2 Subject =~ /\b(foo|bar|etc)/i

score RULE_2 2

describe RULE_2 Rule 2

 

They are WAY longer than that (and some of them include further nesting of
the pipe), but that's the general idea.

 

My question is: is it better performance-wise to have the rules set up like
this, or to have each separate thing have its own separate rule?