Re: Underscores
On Sat, 18 Jul 2009, twofers wrote: I am mainly using the rule to check the header subject, I haven't added it to a body check. ? So, between the 3 choices: 1. /(?:[^_]{1,30}_+){5}/ 2. /\S+_+\S+_+\S+/ 3. R02 /^\S{30,}$/m ?Which covers the most territory given the example I submitted? I'm basically interested in identifying those garbage subject lines laced with characters like underscores, periods, hyphens, semi-colons, etc; so rather than use several rules to trap those individual characters, maybe there is a more effective way to resolve this. Your original example only included underscores. Try this: header XX Subject =~ /(?:[[:alnum:]]{1,30}[^[:alnum:]\s]{1,5}){5}/i -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
Re: Underscores
I am mainly using the rule to check the header subject, I haven't added it to a body check. So, between the 3 choices: 1. /(?:[^_]{1,30}_+){5}/ 2. /\S+_+\S+_+\S+/ 3. R02 /^\S{30,}$/m Which covers the most territory given the example I submitted? I'm basically interested in identifying those garbage subject lines laced with characters like underscores, periods, hyphens, semi-colons, etc; so rather than use several rules to trap those individual characters, maybe there is a more effective way to resolve this. Thanks, Wes
Re: Underscores
On Thu, 16 Jul 2009, Karsten Br?ckelmann wrote: Whoops! Make that: /(?:[^_]{1,30}_+){5}/ Better. ;) However, while that indeed eliminates excessive backtracking as \S or \w results in (since they contain the underscore), this doesn't match "words ending in underscores". A non-underscore [^_] includes space, punctuation, and any other unwanted char. Exactly _five_ occurrences of an '_' underscore, with up to 30 _random_ chars in between. This paragraph matches. :) Sorry. I lost sight of that part... /(?:[^_\s]{1,30}_+){5}/ -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- You know things are bad when Pravda says we [the USA] have gone too far to the left. -- Joe Huffman --- Today: the 64th anniversary of the dawn of the Atomic Age
Re: [sa] Re: Underscores
On Thu, 2009-07-16 at 11:08 -0400, Charles Gregory wrote: > Given that OP said the entire *line* was word-underscore-word-underscore, > then why not just: > > body R01 /^\w{30,}$/m Indeed, it really depends on what *exactly* the rule should match. > Or perhaps the OP wasn't clear on whether 'word' might contain other > punctuation, and so we might simply use: > > body R02 /^\S{30,}$/m This one also matches a long-ish URL on a line of its own. > I might add \s* at the end of the rule, just in case of trailing spaces... Keep in mind, that with body rules, the body is *rendered*. Whitespace normalized, and *paragraphs* re-flowed to a single string with embedded newlines stripped. For instance, this very paragraph is a single ^line$ as far as body REs are concerned. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: [sa] Re: Underscores
On Thu, 16 Jul 2009, Karsten Bräckelmann wrote: /(?:[^_]{1,30}_+){5}/ Better. ;) However, while that indeed eliminates excessive backtracking as \S or \w results in (since they contain the underscore), this doesn't match "words ending in underscores". A non-underscore [^_] includes space, punctuation, and any other unwanted char. Given that OP said the entire *line* was word-underscore-word-underscore, then why not just: body R01 /^\w{30,}$/m Or perhaps the OP wasn't clear on whether 'word' might contain other punctuation, and so we might simply use: body R02 /^\S{30,}$/m I might add \s* at the end of the rule, just in case of trailing spaces... - C
Re: Underscores
> Whoops! Make that: > > /(?:[^_]{1,30}_+){5}/ Better. ;) However, while that indeed eliminates excessive backtracking as \S or \w results in (since they contain the underscore), this doesn't match "words ending in underscores". A non-underscore [^_] includes space, punctuation, and any other unwanted char. Exactly _five_ occurrences of an '_' underscore, with up to 30 _random_ chars in between. This paragraph matches. :) -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Underscores
On Thu, 2009-07-16 at 06:27 -0700, John Hardin wrote: > How about: > > /(?:[^_]{1,30}_+){1,5}/ Whoops! Make that: /(?:[^_]{1,30}_+){5}/ -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
Re: Underscores
From: Matt Kettler Date: Thu, 16 Jul 2009 08:52:50 -0400 twofers wrote: > How can I pattern match when every word has an underscore after it. > Example: > This_sentenance_has_an_underscore_after_every_word > > I'm not really good at Perl pattern matching, but \w and \W see an > underscore as a word character, so I'm just not sure what might work. > > body =~ /^([a-z]+_+)+/i > > Is that something that will work effectively? Is this for a spam rule? I'd do something like this: body MY_UNDERSCORES/\S+_+\S+_+\S+/ Unless you really want to restrict it to A-Z. Regardless, ending any regex in + in a SA rule is redundant. Since + allows a one-instance match, it will devolve to that. You don't need to match the entire line with your rule, so the extra matches are redundant. It will match the first instance, and that's all it needs to be a match. Also any regex ending in * should just have it's last element removed, as that will devolve to a zero-count match. The /\S+_+\S+_+\S+/ rule will lots of technical email, for example discussions on shell environment variables like LD_LIBRARY_PATH. -jeff
Re: Underscores
On Thu, 2009-07-16 at 08:52 -0400, Matt Kettler wrote: > > twofers wrote: > > How can I pattern match when every word has an underscore after it. > > Example: > > This_sentenance_has_an_underscore_after_every_word > > > > body =~ /^([a-z]+_+)+/i > > I'd do something like this: > > body MY_UNDERSCORES/\S+_+\S+_+\S+/ That's quite a lot of backtracking, no? How about: /(?:[^_]{1,30}_+){1,5}/ -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
Re: Underscores
twofers wrote: > How can I pattern match when every word has an underscore after it. > Example: > This_sentenance_has_an_underscore_after_every_word > > I'm not really good at Perl pattern matching, but \w and \W see an > underscore as a word character, so I'm just not sure what might work. > > body =~ /^([a-z]+_+)+/i > > Is that something that will work effectively? > > Thanks. > > Wes > > I'd do something like this: body MY_UNDERSCORES/\S+_+\S+_+\S+/ Unless you really want to restrict it to A-Z. Regardless, ending any regex in + in a SA rule is redundant. Since + allows a one-instance match, it will devolve to that. You don't need to match the entire line with your rule, so the extra matches are redundant. It will match the first instance, and that's all it needs to be a match. Also any regex ending in * should just have it's last element removed, as that will devolve to a zero-count match.
Underscores
How can I pattern match when every word has an underscore after it. Example: This_sentenance_has_an_underscore_after_every_word I'm not really good at Perl pattern matching, but \w and \W see an underscore as a word character, so I'm just not sure what might work. body =~ /^([a-z]+_+)+/i Is that something that will work effectively? Thanks. Wes
Re: 2.64 - SUBJ_HAS_UNIQ_ID - incorrect interpretation of underscores??
I just finally turned this rule off. For some reason it has started triggering on a whole lot of my normal mail, which isn't useful and is creating a bunch of FPs. I don't think I've ever seen it trigger on spam... :-) Loren
Re: 2.64 - SUBJ_HAS_UNIQ_ID - incorrect interpretation of underscores??
At 08:30 PM 12/10/2004, Matt Kettler wrote: >The rule doesn't do very well anyway: > > 1.039 1.1433 0.11900.906 0.730.90 SUBJ_HAS_UNIQ_ID > >Hence the <1 score it receives. Perhaps this is a decent chunk of why the rule doesn't perform well It might be worth looking into modifying that regex in the eval to try to get better performance, or splitting them up so you can test each separately... Nevermind. Looking at my most recent 300 spams, only one matched, and that didn't have a UNIQ_ID.. Subject: {SPAM} 0rder your meds" today` It doesn't look like spammers use UNIQ ID's in the subject lines often anymore.. The only one I did find, doesn't match the rule: Subject: {SPAM} STOP_PAYING_FOR YOUR Cable_Movies e6pgu5 There are others posing as shipment notices, but the rule tries to skip them on purpose.. Subject: {SPAM} Fedex Ship Notification, Tracking Number : VBN24530946 - 40352TZLP Subject: {SPAM} Fedex Delivery Confirmation, Tracking Number : ITZ65070066405343DJCK
Re: 2.64 - SUBJ_HAS_UNIQ_ID - incorrect interpretation of underscores??
At 04:06 PM 12/10/2004, Theo Van Dinter wrote: It's not simply a hyphenated word. It looks like two long sets of characte= rs with a hyphen in the middle, which is the exact same thing as a unique id. The rule doesn't do very well anyway: 1.039 1.1433 0.11900.906 0.730.90 SUBJ_HAS_UNIQ_ID Hence the <1 score it receives. Perhaps this is a decent chunk of why the rule doesn't perform well It might be worth looking into modifying that regex in the eval to try to get better performance, or splitting them up so you can test each separately...
Re: 2.64 - SUBJ_HAS_UNIQ_ID - incorrect interpretation of underscores??
On Fri, Dec 10, 2004 at 08:31:57PM +0100, Per Jessen wrote: > > No. The issue is that "cumulus-bonuspunkten" looks like an ID tag. > > Should SUBJ_HAS_UNIQ_ID really fire on that - simply a hyphenated word? There > are plenty of those around (although less in german then in english). It's not simply a hyphenated word. It looks like two long sets of characters with a hyphen in the middle, which is the exact same thing as a unique id. The rule doesn't do very well anyway: 1.039 1.1433 0.11900.906 0.730.90 SUBJ_HAS_UNIQ_ID Hence the <1 score it receives. -- Randomly Generated Tagline: "Linux poses a real challenge for those with a taste for late-night hacking (and/or conversations with God)." (By Matt Welsh) pgpQIdHBQJj0N.pgp Description: PGP signature
Re: 2.64 - SUBJ_HAS_UNIQ_ID - incorrect interpretation of underscores??
Theo Van Dinter wrote: > On Fri, Dec 10, 2004 at 01:25:43PM +0100, Per Jessen wrote: >> Why does SUBJ_HAS_UNIQ_ID fire on this subject: >> >> Subject: =?iso-8859-1?Q?MIGROL_Heiz=F6l-Angebot_mit_Cumulus-Bonuspunkten?= >> >> Is this a bug in the RFC2047 decoding in SA 2.64? > > No. The issue is that "cumulus-bonuspunkten" looks like an ID tag. Should SUBJ_HAS_UNIQ_ID really fire on that - simply a hyphenated word? There are plenty of those around (although less in german then in english). -- Per Jessen, Zurich Let your spam stop here -- http://www.spamchek.com
Re: 2.64 - SUBJ_HAS_UNIQ_ID - incorrect interpretation of underscores??
On Fri, Dec 10, 2004 at 01:25:43PM +0100, Per Jessen wrote: > Why does SUBJ_HAS_UNIQ_ID fire on this subject: > > Subject: =?iso-8859-1?Q?MIGROL_Heiz=F6l-Angebot_mit_Cumulus-Bonuspunkten?= > > Is this a bug in the RFC2047 decoding in SA 2.64? No. The issue is that "cumulus-bonuspunkten" looks like an ID tag. -- Randomly Generated Tagline: "Any similarity to person/persons now living to anyone or thing, dead or undead, is entirely accidental and just one more irrefutable proof of the paranormal." - From the 7th Guest pgp3wtKYoXpsM.pgp Description: PGP signature
2.64 - SUBJ_HAS_UNIQ_ID - incorrect interpretation of underscores??
Why does SUBJ_HAS_UNIQ_ID fire on this subject: Subject: =?iso-8859-1?Q?MIGROL_Heiz=F6l-Angebot_mit_Cumulus-Bonuspunkten?= It looks as SA mistakenly interprets the underscores as underscores - which in an RFC2047 encoded string, they're not - http://rfc.net/rfc2047.html , Is this a bug in the RFC2047 decoding in SA 2.64? -- Per Jessen, Zurich Let your spam stop here -- http://www.spamchek.com