Re: Spam Pattern

2014-02-14 Thread Amir Caspi
On Feb 14, 2014, at 1:04 PM, Adam Katz  wrote:

> Noo, don't do that.  (?:\s*\w+)+  is a ReDoS bomb (and you have it ten 
> times!) which will destroy your 

Whoops, you're very right.  Removing the + after the \w (that is, turning it to 
(?:\s*\w)+ ) should match the same things but without this exponential 
branching... I think.

--- Amir

Re: Spam Pattern

2014-02-14 Thread Adam Katz
On 02/14/2014 11:23 AM, Amir Caspi wrote:
> To be clear, that wasn't my sample; I am not the originator of this
> thread.

Whoops, my bad.  My point was clear anyway.

> What about this, a variant of what I posted earlier?  It requires 10
> matches, but I believe it does the same thing as yours except it does
> not limit the word size between hashes, and allows for whitespace:
>
> rawbody AC_REPEATED_HASHCODE/(\s[a-f0-9]{25,}\s)(?:(?:\s*\w+)+\1){10}
>
> Yours also limits the amount of characters between repeated hashes to
> 99, but this might well not be the case.

Noo, don't do that.  (?:\s*\w+)+  is a *ReDoS
**bomb* (and you have it ten
times!) which will destroy your efficiency.  Think about how it would
match the string "aa" (or ANY word, for that matter).  Here are its
trials, matching each of the nested parentheses to illustrate the logic:

 1. (aa)
 2. (a)(a)
 3. ()(aa)
 4. ()(a)(a)
 5. (aaa)(aaa)
 6. (aaa)(aa)(a)
 7. (aaa)(a)(a)(a)
 8. (aa)()
 9. (aa)(aaa)(a)
10. (aa)(aa)(aa)
11. (aa)(a)(aaa)
12. (aa)(a)(aa)(a)
13. (aa)(a)(a)(aa)
14. (aa)(a)(a)(a)(a)
15. (a)(a)
16. (a)()(a)
17. (a)(aaa)(aa)
18. (a)(aaa)(a)(a)
19. (a)(aa)(aaa)
20. (a)(aa)(aa)(a)
21. (a)(aa)(a)(a)(a)
22. (a)(a)()
23. (a)(a)(aaa)(a)
24. (a)(a)(aa)(aa)
25. (a)(a)(aa)(a)(a)
26. (a)(a)(a)(aaa)
27. (a)(a)(a)(aa)(a)
28. (a)(a)(a)(a)(aa)
29. (a)(a)(a)(a)(a)(a)
30. (no match)

You want to fail faster than that!

I call these ReDoS "bombs" though Wikipedia uses the term "evil."  Given
how they're rarely intended, I don't like that term.  An actual evil
ReDoS, snuck in and uncaught, would be exploited in a "ReDoS attack." 
(A ReDoS attack could also exploit an unintended bomb.)


signature.asc
Description: OpenPGP digital signature


Re: Spam Pattern

2014-02-14 Thread Amir Caspi
On Feb 14, 2014, at 11:53 AM, Adam Katz  wrote:

> some of your sample's strings had an extra character on the end.
> 

To be clear, that wasn't my sample; I am not the originator of this thread.

> This version of the rule is more expensive, but is safer to score higher 
> (maybe 3-4 points): body  HEXHASH_WORD_5  
> /\b[a-z]{1,10}\s([0-9a-f]{30})(?:.{0,99}\b[a-z]{1,10}\s\1){4}/
> describe  HEXHASH_WORD_5  Five copies of the same hexadecimal hash, each 
> following a word

What about this, a variant of what I posted earlier?  It requires 10 matches, 
but I believe it does the same thing as yours except it does not limit the word 
size between hashes, and allows for whitespace:

rawbody AC_REPEATED_HASHCODE
/(\s[a-f0-9]{25,}\s)(?:(?:\s*\w+)+\1){10}

Yours also limits the amount of characters between repeated hashes to 99, but 
this might well not be the case.

> I know you don't have Bayes enabled

Again to reiterate, I'm not the originator...

Cheers.

--- Amir



Re: Spam Pattern

2014-02-14 Thread Adam Katz
Ha!  I checked my mail before sending this; we're on the same wavelength
yet our emails are out of sync.  You just suggested the same thing I was
leaning on.

On 02/14/2014 10:53 AM, John Hardin wrote:
> S/O is a little surprising:
>
> http://ruleqa.spamassassin.org/?daterev=20140213-r1567864-n&rule=%2FHEXHASH
>
>
> I'm curious as to what hits that in ham...
>
> Perhaps more repetitions would improve that?

I'm actually thinking of replacing the leading \b with a \s to avoid
matching paths and extensions and maybe requiring two preceding words to
avoid a list of file/md5 pairings.  We can experiment with different hit
thresholds as well.

body  __HEXHASHWORD   /(?:\s[a-z]{1,10}){2}\s[0-9a-f]{30}/
tflags__HEXHASHWORD   multiple maxhits=8
meta  HEXHASH_WORD_5  __HEXHASHWORD >= 5
describe  HEXHASH_WORD_5  5 hexadecimal hashes, each following two words
meta  HEXHASH_WORD_6  __HEXHASHWORD >= 6
describe  HEXHASH_WORD_6  6 hexadecimal hashes, each following two words
meta  HEXHASH_WORD_7  __HEXHASHWORD >= 7
describe  HEXHASH_WORD_7  7 hexadecimal hashes, each following two words
meta  HEXHASH_WORD_8  __HEXHASHWORD >= 8
describe  HEXHASH_WORD_8  8 hexadecimal hashes, each following two words


Users:  Do /not/ implement all of these at once.  This is for Rule QA
testing only.  Once we have results, we can figure out which threshold
is best and then come up with a suggestion or published rule.  (Maybe
tflags nopublish is wise here.)


signature.asc
Description: OpenPGP digital signature


Re: Spam Pattern

2014-02-14 Thread John Hardin

On Fri, 14 Feb 2014, Adam Katz wrote:


Yes, there is an increased FP risk due to the ability to match different
hex strings (e.g. a list of checksums).  That's probably where the
current Rule QA FPs  come
from.


Good point. Perhaps it should be /\s[a-z]{1,10} rather than /\b[a-z]{1,10}
so that filename extensions don't match.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...every time I sit down in front of a Windows machine I feel as
  if the computer is just a place for the manufacturers to put their
  advertising. -- fwadling on Y! SCOX
---
 8 days until George Washington's 282nd Birthday


Re: Spam Pattern

2014-02-14 Thread Adam Katz
On Feb 14, 2014, at 11:00 AM, Adam Katz mailto:antis...@khopis.com>> wrote:
>>
>> Given the nature of the content, I'd go the other direction and not
>> require the word boundary.  This removes the wildcard, though it
>> doesn't short circuit as quickly, so one could debate which version
>> is more efficient.
>> body  __HEXHASHWORD   /\b[a-z]{1,10}\s[0-9a-f]{30}/
>> tflags__HEXHASHWORD   multiple maxhits=5
>> meta  HEXHASH_WORD__HEXHASHWORD > 4
>> describe  HEXHASH_WORDFive hexadecimal hashes, each following a word
>>

On 02/14/2014 10:12 AM, Amir Caspi wrote:
>
> The main issue I have with the code above, or any tflags=multiple
> code, is that it doesn't require the _same_ hex string, just _any_ 5
> hex strings within an email.  Granted, the emails where that appears
> are likely to be spam, but they may not necessarily be.  I think
> forcing the repetition check is important, although the only good way
> to do that is with backreferences (as I sent a day or two ago) and
> that is likely a CPU hog.
>

Yes, there is an increased FP risk due to the ability to match different
hex strings (e.g. a list of checksums).  That's probably where the
current Rule QA FPs  come
from.  Still, it gets a decent .968 S/O (relative precision
) with a very small
number of FPs (0.0104%).  Based on this, this is likely safe to assign a
point or so to.

If you want to assign a high score (3+), you'd be absolutely correct on
needing the full match (though watch for truncation; some of your sample
's strings had an extra character on the end.

This version of the rule is more expensive, but is safer to score higher
(maybe 3-4 points):

body  HEXHASH_WORD_5  
/\b[a-z]{1,10}\s([0-9a-f]{30})(?:.{0,99}\b[a-z]{1,10}\s\1){4}/
describe  HEXHASH_WORD_5  Five copies of the same hexadecimal hash, each 
following a word


I know you don't have Bayes enabled, but Bayes is the best source of
negative points, which is to say that if you had Bayes turned on (and it
weren't enough to catch this spam itself), you could rely on negative
points from Bayes preventing an FP from exceeding your spam threshold
and therefore assign this rule slightly more points.  (Be careful with
that premise, it doesn't scale; Bayes provides a limited number of
negative points and doesn't fire on all ham.)

> Another problem with the above code is that you require only a short
> word (1-10 chars) prior to the hex string.  Some perfectly legitimate,
> or even illegitimate, words could be longer than 10 chars.  I'd
> increase the upper limit to something like 15ish, but, per above, I
> think the potential for FPs is reasonably high here.
>

Your sample did not contain any 7+ character words preceding the long
hex string, so broadening that range beyond the three character buffer
we've already afforded it merely increases your FP risk (note that there
were twelve copies of that string in the sample while the rule only
requires five; I figure there will be five 1-10 char words followed by
30-char hex strings).  File names can be longer and could therefore
become FPs.


signature.asc
Description: OpenPGP digital signature


Re: Spam Pattern

2014-02-14 Thread John Hardin

On Fri, 14 Feb 2014, Amir Caspi wrote:

Another problem with the above code is that you require only a short 
word (1-10 chars) prior to the hex string.  Some perfectly legitimate, 
or even illegitimate, words could be longer than 10 chars.  I'd increase 
the upper limit to something like 15ish


Granted. That's indeed part of the tuning procedure. Let's revisit the 
masscheck performance in a couple of days after the changes I just made to 
get a baseline, and then I'll increase it to 20.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Where We Want You To Go Today 09/13/07: Microsoft patents in-OS
  adware architecture that incorporates monitoring and analysis of
  user actions and interrupting the user to display apparently
  relevant advertisements (U.S. Patent #20070214042)
---
 8 days until George Washington's 282nd Birthday


Re: Spam Pattern

2014-02-14 Thread John Hardin

On Fri, 14 Feb 2014, Adam Katz wrote:


Given the nature of the content, I'd go the other direction and not
require the word boundary.  This removes the wildcard, though it doesn't
short circuit as quickly, so one could debate which version is more
efficient.

body  __HEXHASHWORD   /\b[a-z]{1,10}\s[0-9a-f]{30}/


Yeah, that would work. Adjusting sandbox.


tflags__HEXHASHWORD   multiple maxhits=5
meta  HEXHASH_WORD__HEXHASHWORD > 4
describe  HEXHASH_WORDFive hexadecimal hashes, each following a word

I'm curious if the hex string is always so similar; it may be enough to
use  \bb8b177bf24975  and not need the tflags multiple piece.


I think that would be a little *too* conservative.

S/O is a little surprising:

http://ruleqa.spamassassin.org/?daterev=20140213-r1567864-n&rule=%2FHEXHASH

I'm curious as to what hits that in ham...

Perhaps more repetitions would improve that?


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Where We Want You To Go Today 09/13/07: Microsoft patents in-OS
  adware architecture that incorporates monitoring and analysis of
  user actions and interrupting the user to display apparently
  relevant advertisements (U.S. Patent #20070214042)
---
 8 days until George Washington's 282nd Birthday


Re: Spam Pattern

2014-02-14 Thread Amir Caspi
On Feb 14, 2014, at 11:00 AM, Adam Katz  wrote:

> Given the nature of the content, I'd go the other direction and not require 
> the word boundary.  This removes the wildcard, though it doesn't short 
> circuit as quickly, so one could debate which version is more efficient.
> body  __HEXHASHWORD   /\b[a-z]{1,10}\s[0-9a-f]{30}/
> tflags__HEXHASHWORD   multiple maxhits=5
> meta  HEXHASH_WORD__HEXHASHWORD > 4
> describe  HEXHASH_WORDFive hexadecimal hashes, each following a word
> I'm curious if the hex string is always so similar; it may be enough to use  
> \bb8b177bf24975  and not need the tflags multiple piece.
The hex string is not always that similar; I've had similar spams with 
completely different strings.  The same string is repeated multiple times per 
email, but it's different in each email.  I would not hardcode the hex string 
at all.

The main issue I have with the code above, or any tflags=multiple code, is that 
it doesn't require the _same_ hex string, just _any_ 5 hex strings within an 
email.  Granted, the emails where that appears are likely to be spam, but they 
may not necessarily be.  I think forcing the repetition check is important, 
although the only good way to do that is with backreferences (as I sent a day 
or two ago) and that is likely a CPU hog.

Another problem with the above code is that you require only a short word (1-10 
chars) prior to the hex string.  Some perfectly legitimate, or even 
illegitimate, words could be longer than 10 chars.  I'd increase the upper 
limit to something like 15ish, but, per above, I think the potential for FPs is 
reasonably high here.

Cheers.

--- Amir

Re: Spam Pattern

2014-02-14 Thread Adam Katz
On 02/12/2014 01:46 PM, John Hardin wrote:
> On Wed, 12 Feb 2014, Axb wrote:
>> On 02/12/2014 10:06 PM, John Hardin wrote:
>>>  Perhaps something like this:
>>>
>>>  body  __HEXHASHWORD   /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/
>>>  tflags__HEXHASHWORD   multiple maxhits=5
>>>  meta  HEXHASH_WORD__HEXHASHWORD > 4
>>>  describe  HEXHASH_WORDHexadecimal hash followed by a word
>>>
>>>  Added to my sandbox, just in case.
>>
>> John,
>>
>> Isn't {30,} (without a limit) dangerously expensive?
>
> Potentially expensive; the character class and the fact that the
> following atom is not in that class limits the risk - backtracking
> isn't a possibility. However, point taken - recommend {30,64} instead.

Given the nature of the content, I'd go the other direction and not
require the word boundary.  This removes the wildcard, though it doesn't
short circuit as quickly, so one could debate which version is more
efficient.

body  __HEXHASHWORD   /\b[a-z]{1,10}\s[0-9a-f]{30}/
tflags__HEXHASHWORD   multiple maxhits=5
meta  HEXHASH_WORD__HEXHASHWORD > 4
describe  HEXHASH_WORDFive hexadecimal hashes, each following a word

I'm curious if the hex string is always so similar; it may be enough to
use  \bb8b177bf24975  and not need the tflags multiple piece.



signature.asc
Description: OpenPGP digital signature


Re: Spam Pattern

2014-02-12 Thread Axb

On 02/12/2014 10:46 PM, John Hardin wrote:

On Wed, 12 Feb 2014, Axb wrote:


On 02/12/2014 10:06 PM, John Hardin wrote:


 Perhaps something like this:

 body  __HEXHASHWORD   /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/
 tflags__HEXHASHWORD   multiple maxhits=5
 meta  HEXHASH_WORD__HEXHASHWORD > 4
 describe  HEXHASH_WORDHexadecimal hash followed by a word

 Added to my sandbox, just in case.


John,

Isn't {30,} (without a limit) dangerously expensive?


Potentially expensive; the character class and the fact that the
following atom is not in that class limits the risk - backtracking isn't
a possibility. However, point taken - recommend {30,64} instead.


imo, you don't even need to count that much - I'd stop at sweet 16, 
anything above is pink noise and not waste time chasing spaces & co.






Re: Spam Pattern

2014-02-12 Thread John Hardin

On Wed, 12 Feb 2014, Axb wrote:


On 02/12/2014 10:46 PM, John Hardin wrote:

 On Wed, 12 Feb 2014, Axb wrote:

>  On 02/12/2014 10:06 PM, John Hardin wrote:
> > 
> >   Perhaps something like this:
> > 
> >   body  __HEXHASHWORD   /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/

> >   tflags__HEXHASHWORD   multiple maxhits=5
> >   meta  HEXHASH_WORD__HEXHASHWORD > 4
> >   describe  HEXHASH_WORDHexadecimal hash followed by a word
> > 
> >   Added to my sandbox, just in case.
> 
>  John,
> 
>  Isn't {30,} (without a limit) dangerously expensive?


 Potentially expensive; the character class and the fact that the
 following atom is not in that class limits the risk - backtracking isn't
 a possibility. However, point taken - recommend {30,64} instead.


imo, you don't even need to count that much - I'd stop at sweet 16, anything 
above is pink noise and not waste time chasing spaces & co.


That increases the FP risk, though. Having just hex strings in a email 
is not inherently a good spam sign, I would think, thus the desire to 
match long hex string + word with no intervening punctuation.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  WSJ on the Financial Stimulus package: "...today there are 700,000
  fewer jobs than [the administration] predicted we would have if we
  had done nothing at all."
---
 Today: Abraham Lincoln's and Charles Darwin's 205th Birthdays


Re: Spam Pattern

2014-02-12 Thread John Hardin

On Wed, 12 Feb 2014, Axb wrote:


On 02/12/2014 10:06 PM, John Hardin wrote:


 Perhaps something like this:

 body  __HEXHASHWORD   /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/
 tflags__HEXHASHWORD   multiple maxhits=5
 meta  HEXHASH_WORD__HEXHASHWORD > 4
 describe  HEXHASH_WORDHexadecimal hash followed by a word

 Added to my sandbox, just in case.


John,

Isn't {30,} (without a limit) dangerously expensive?


Potentially expensive; the character class and the fact that the following 
atom is not in that class limits the risk - backtracking isn't a 
possibility. However, point taken - recommend {30,64} instead.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  WSJ on the Financial Stimulus package: "...today there are 700,000
  fewer jobs than [the administration] predicted we would have if we
  had done nothing at all."
---
 Today: Abraham Lincoln's and Charles Darwin's 205th Birthdays


Re: Spam Pattern

2014-02-12 Thread Amir Caspi
On Feb 12, 2014, at 2:13 PM, Axb  wrote:

> Isn't {30,} (without a limit) dangerously expensive?

It has a limit -- the whitespace at the end of the string is required.  In this 
case, it should be fine, the regexp cannot match "infinitely" many characters, 
and it's also sort of required, because if a limit of (say) 100 were enacted, 
then hexhashes with 101+ characters would escape detection.

Cheers.

--- Amir



Effectiveness of Bayes poisoning (was Re: Spam Pattern)

2014-02-12 Thread David F. Skoll
On Wed, 12 Feb 2014 13:11:19 -0800 (PST)
John Hardin  wrote:

> That only works if your hammy mail stream contains text that looks
> like the random garbage they put in to try to spoof bayes.

Indeed.  Just for kicks, I ran the OP's pastebin example through our
Bayes database and it scored 99.99% likelihood of spam.  The word
"Wopsle", for example, was a dead giveaway... that never appears in
our ham stream, but has appeared in 93 spams in our database.

Bayes poisoning, in our experience, is only occasionally effective.

Regards,

David.



Re: Spam Pattern

2014-02-12 Thread Axb

On 02/12/2014 10:06 PM, John Hardin wrote:

On Wed, 12 Feb 2014, Joe Quinn wrote:


On 2/12/2014 3:15 PM, John Hardin wrote:

 On Wed, 12 Feb 2014, Joe Quinn wrote:

>  This pattern has been showing up in a good 80% of spam I have
looked at >  in the past month.
> >  Spammers take a few paragraphs out of a large body of text and
put it at >  the end of their email. My favorite is one that had the
scene where >  Daisy first meets Jay Gatsby.
> >  Sometimes they add some munging, or like in this example they
insert >  base64-encoded hashes. We have a rule for the plaintext
hashes, but does >  anyone on the list have a good way of detecting
this?

 Bayes.


Any ideas outside of Bayes? We don't currently have it configured, and
the setup involved is more than we would like to do for just one rule,
if at all possible.


Bayes is very useful, you should reconsider.

Perhaps something like this:

body  __HEXHASHWORD   /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/
tflags__HEXHASHWORD   multiple maxhits=5
meta  HEXHASH_WORD__HEXHASHWORD > 4
describe  HEXHASH_WORDHexadecimal hash followed by a word

Added to my sandbox, just in case.


John,

Isn't {30,} (without a limit) dangerously expensive?




Re: Spam Pattern

2014-02-12 Thread John Hardin

On Wed, 12 Feb 2014, Amir Caspi wrote:


On Feb 12, 2014, at 1:15 PM, John Hardin  wrote:


Bayes.


Well, yes and no.  Bayes isn't very good about detecting this kind of 
thing per se because it's full of random crap... in fact, they 
specifically pull text from innocuous things like web reviews, movie 
reviews, news articles, etc. in the hopes that it contains a lot of 
hammy tokens that will negate the spammy ones.


That only works if your hammy mail stream contains text that looks like 
the random garbage they put in to try to spoof bayes.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  WSJ on the Financial Stimulus package: "...today there are 700,000
  fewer jobs than [the administration] predicted we would have if we
  had done nothing at all."
---
 Today: Abraham Lincoln's and Charles Darwin's 205th Birthdays


Re: Spam Pattern

2014-02-12 Thread John Hardin

On Wed, 12 Feb 2014, Joe Quinn wrote:


On 2/12/2014 3:15 PM, John Hardin wrote:

 On Wed, 12 Feb 2014, Joe Quinn wrote:

>  This pattern has been showing up in a good 80% of spam I have looked at 
>  in the past month.
> 
>  Spammers take a few paragraphs out of a large body of text and put it at 
>  the end of their email. My favorite is one that had the scene where 
>  Daisy first meets Jay Gatsby.
> 
>  Sometimes they add some munging, or like in this example they insert 
>  base64-encoded hashes. We have a rule for the plaintext hashes, but does 
>  anyone on the list have a good way of detecting this?


 Bayes.


Any ideas outside of Bayes? We don't currently have it configured, and the 
setup involved is more than we would like to do for just one rule, if at all 
possible.


Bayes is very useful, you should reconsider.

Perhaps something like this:

body  __HEXHASHWORD   /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/
tflags__HEXHASHWORD   multiple maxhits=5
meta  HEXHASH_WORD__HEXHASHWORD > 4
describe  HEXHASH_WORDHexadecimal hash followed by a word

Added to my sandbox, just in case.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Mine eyes have seen the horror of the voting of the horde;
  They've looted the fromagerie where guv'ment cheese is stored;
  If war's not won before the break they grow so quickly bored;
  Their vote counts as much as yours.  -- Tam
---
 Today: Abraham Lincoln's and Charles Darwin's 205th Birthdays


Re: Spam Pattern

2014-02-12 Thread Axb

On 02/12/2014 09:02 PM, Joe Quinn wrote:

This pattern has been showing up in a good 80% of spam I have looked at
in the past month.

Spammers take a few paragraphs out of a large body of text and put it at
the end of their email. My favorite is one that had the scene where
Daisy first meets Jay Gatsby.

Sometimes they add some munging, or like in this example they insert
base64-encoded hashes. We have a rule for the plaintext hashes, but does
anyone on the list have a good way of detecting this?

Example: http://pastebin.com/zCStErch


btw - how many BASE8 strings will you find in ham? .-)

that should give you a pointer for a decent rule.



Re: Spam Pattern

2014-02-12 Thread Axb

On 02/12/2014 09:02 PM, Joe Quinn wrote:

This pattern has been showing up in a good 80% of spam I have looked at
in the past month.

Spammers take a few paragraphs out of a large body of text and put it at
the end of their email. My favorite is one that had the scene where
Daisy first meets Jay Gatsby.

Sometimes they add some munging, or like in this example they insert
base64-encoded hashes. We have a rule for the plaintext hashes, but does
anyone on the list have a good way of detecting this?

Example: http://pastebin.com/zCStErch



bayes, bayes, bayes




Re: Spam Pattern

2014-02-12 Thread Amir Caspi
On Feb 12, 2014, at 1:15 PM, John Hardin  wrote:

> Bayes.

Well, yes and no.  Bayes isn't very good about detecting this kind of thing per 
se because it's full of random crap... in fact, they specifically pull text 
from innocuous things like web reviews, movie reviews, news articles, etc. in 
the hopes that it contains a lot of hammy tokens that will negate the spammy 
ones.  On the other hand, there's no real good way of detecting "lots of 
garbage filler text" without a natural language algorithm that could 
heuristically determine whether the primary content (as determined by subject, 
etc.) is related to the filler... and I don't think any such algorithms exist.  
Bayes provides a way of distilling the garbage into tokens and sifting through 
it objectively, so it's the best option, but I wouldn't say it's a method of 
"detecting" this kind of thing.

That said, this particular spam template is interspersed with some sort of 
hashcode which is repeated a number of times.  It could be possible to write a 
rule that matches a long (20-30 chars) alphanumeric string and count 
repetitions; if the same long string is repeated more than (say) 10 times, 
there's a good bet it's an embedded spammy hashcode.

I'd write an example rule but I don't know how to store regexp matches from one 
test to see if they match another test... that is, writing a regexp and using 
tflags multiple on it would be fine if we wanted it to hit on 10 or more long 
strings even if those strings don't match, but if we want to see if there are 
10 or more repeated long strings that are identical, we have to store it 
somehow, and I don't know how to do that with SA.

If SA allows backreferences (since Perl does) then something like the following 
MIGHT work, though I suspect it would be a horrible CPU hog:

rawbody AC_REPEATED_HASHCODE
/(\s[A-Za-z0-9]{25,}\s)(?:(?:\s*\w+)+\1){10}

This will look for a 25-character string, and look for 10 more repetitions of 
that string surrounded by an arbitrary number of words.  This is untested so I 
don't know if it'll work for sure, and I suspect it wouldn't be very friendly 
to the CPU.  The previous method of matching a string, storing it, and looking 
for repetitions of that string, would be preferable, but I don't know how to do 
that with SA.

--- Amir




Re: Spam Pattern

2014-02-12 Thread RW
On Wed, 12 Feb 2014 15:02:20 -0500
Joe Quinn wrote:

> This pattern has been showing up in a good 80% of spam I have looked
> at in the past month.
> 
> Spammers take a few paragraphs out of a large body of text and put it
> at the end of their email. My favorite is one that had the scene
> where Daisy first meets Jay Gatsby.
> 
> Sometimes they add some munging, or like in this example they insert 
> base64-encoded hashes.

It's not base64, it's just hexadecimal. 

I don't see any particular reason to think they are hashes.

>  We have a rule for the plaintext hashes,

I presume you've mixed up your examples and given the "plaintext"
version, base64 should be just as easy to spot because of the way
its padded-out.

 
> Example: http://pastebin.com/zCStErch


Re: Spam Pattern

2014-02-12 Thread Joe Quinn

On 2/12/2014 3:15 PM, John Hardin wrote:

On Wed, 12 Feb 2014, Joe Quinn wrote:

This pattern has been showing up in a good 80% of spam I have looked 
at in the past month.


Spammers take a few paragraphs out of a large body of text and put it 
at the end of their email. My favorite is one that had the scene 
where Daisy first meets Jay Gatsby.


Sometimes they add some munging, or like in this example they insert 
base64-encoded hashes. We have a rule for the plaintext hashes, but 
does anyone on the list have a good way of detecting this?


Bayes.

Any ideas outside of Bayes? We don't currently have it configured, and 
the setup involved is more than we would like to do for just one rule, 
if at all possible.


Re: Spam Pattern

2014-02-12 Thread John Hardin

On Wed, 12 Feb 2014, Joe Quinn wrote:

This pattern has been showing up in a good 80% of spam I have looked at in 
the past month.


Spammers take a few paragraphs out of a large body of text and put it at the 
end of their email. My favorite is one that had the scene where Daisy first 
meets Jay Gatsby.


Sometimes they add some munging, or like in this example they insert 
base64-encoded hashes. We have a rule for the plaintext hashes, but does 
anyone on the list have a good way of detecting this?


Bayes.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Mine eyes have seen the horror of the voting of the horde;
  They've looted the fromagerie where guv'ment cheese is stored;
  If war's not won before the break they grow so quickly bored;
  Their vote counts as much as yours.  -- Tam
---
 Today: Abraham Lincoln's and Charles Darwin's 205th Birthdays


Spam Pattern

2014-02-12 Thread Joe Quinn
This pattern has been showing up in a good 80% of spam I have looked at 
in the past month.


Spammers take a few paragraphs out of a large body of text and put it at 
the end of their email. My favorite is one that had the scene where 
Daisy first meets Jay Gatsby.


Sometimes they add some munging, or like in this example they insert 
base64-encoded hashes. We have a rule for the plaintext hashes, but does 
anyone on the list have a good way of detecting this?


Example: http://pastebin.com/zCStErch

Regards,
JMQ


Re: New (to me) spam pattern

2007-11-03 Thread John D. Hardin
On Sat, 3 Nov 2007, Chris Edwards wrote:

> On Fri, 2 Nov 2007, Mike Kenny wrote:
> 
> | Thanks John, I had tried this. It appears that the \1 is
> | not defined within the pattern. Only for substitution?
> 
> The regex John posted is fine in SA.
> 
>   //
> 
> Mike, what's going wrong for you ?  A lint error ?  Failure to
> match ?

Confirmed, now that I've had a chance to test it.

Here's a slightly stricter version:

  header XX From =~ /]{1,40})[EMAIL PROTECTED]>/i

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
  does quite what I want. I wish Christopher Robin was here."
   -- Peter da Silva in a.s.r
---
 Tomorrow: Daylight Saving Time ends in U.S. - Fall Back



Re: New (to me) spam pattern

2007-11-03 Thread Chris Edwards
On Fri, 2 Nov 2007, Mike Kenny wrote:

| Thanks John, I had tried this. It appears that the \1 is not defined within
| the pattern. Only for substitution?

Hi,

The regex John posted is fine in SA.

  //

Mike, what's going wrong for you ?  A lint error ?  Failure to match ?


Re: New (to me) spam pattern

2007-11-02 Thread John D. Hardin
On Fri, 2 Nov 2007, Mike Kenny wrote:

> Thanks John, I had tried this. It appears that the \1 is not
> defined within the pattern. Only for substitution?

It should work within perl match REs per "man perlre". I'm not sure 
how SA changes that context.

You might also try:

  //

but I'm less confident $+ will work in a match (vs. a substitution).

> On 11/2/07, John D. Hardin <[EMAIL PROTECTED]> wrote:
>
> >   header XX From =~ //

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
  does quite what I want. I wish Christopher Robin was here."
   -- Peter da Silva in a.s.r
---
 2 days until Daylight Saving Time ends in U.S. - Fall Back



Re: New (to me) spam pattern

2007-11-02 Thread Mike Kenny
Thanks John, I had tried this. It appears that the \1 is not defined within
the pattern. Only for substitution?

mike

On 11/2/07, John D. Hardin <[EMAIL PROTECTED]> wrote:
>
> On Fri, 2 Nov 2007, Mike Kenny wrote:
>
> > I have a number of users that are receiving spam of varying types. The
> only
> > common factor is the from address. This looks like
> >
> > from=<[EMAIL PROTECTED]>
> >
> > where sX.com looks like it is a genuine site name, e.g.
> > shibatec.com
> > southstreetfinancial.com
> > skiprockmultimedia.com
> >
> > etc.
> >
> > What I need (I think) is a perl regex that will match the above
> > patter. This is beyond my experience, can anybody assist me?
>
> Backreferences.
>
> Try this - I haven't had a chance to test it yet:
>
>   header XX From =~ //
>
> --
> John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
> [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
>   does quite what I want. I wish Christopher Robin was here."
>-- Peter da Silva in a.s.r
> ---
> 2 days until Daylight Saving Time ends in U.S. - Fall Back
>
>


Re: New (to me) spam pattern

2007-11-02 Thread John D. Hardin
On Fri, 2 Nov 2007, Mike Kenny wrote:

> I have a number of users that are receiving spam of varying types. The only
> common factor is the from address. This looks like
> 
> from=<[EMAIL PROTECTED]>
> 
> where sX.com looks like it is a genuine site name, e.g.
> shibatec.com
> southstreetfinancial.com
> skiprockmultimedia.com
> 
> etc.
> 
> What I need (I think) is a perl regex that will match the above
> patter. This is beyond my experience, can anybody assist me?

Backreferences.

Try this - I haven't had a chance to test it yet:

  header XX From =~ //

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
  does quite what I want. I wish Christopher Robin was here."
   -- Peter da Silva in a.s.r
---
 2 days until Daylight Saving Time ends in U.S. - Fall Back



New (to me) spam pattern

2007-11-02 Thread Mike Kenny
I have a number of users that are receiving spam of varying types. The only
common factor is the from address. This looks like

from=<[EMAIL PROTECTED]>

where sX.com looks like it is a genuine site name, e.g.
shibatec.com
southstreetfinancial.com
skiprockmultimedia.com

etc.

What I need (I think) is a perl regex that will match the above patter. This
is beyond my experience, can anybody assist me?
Or offer another alternative to block these spams?

thanks

mike