On Sat, 29 Jul 2006, John D. Hardin wrote:
On Sat, 29 Jul 2006, Loren Wilton wrote:
From: Rory [mailto:[EMAIL PROTECTED]
From: Barbra [mailto:[EMAIL PROTECTED]

Something like

header FROMFROM    =~ /[A-Z]\w+ \[mailto\: \w+\.\w+\@/

There is a way to be more specific, but it costs considerably
more.

Namely:

  header   FROM_REPEAT  From =~ /\b(\w{1,20})\.\1\@/

Incorrect results returned quickly are useless.

Adding a test for a single-word unquoted display name would reduce the
cost as the RE engine wouldn't get to the expensive backreference

Ironically, and somewhat amusingly, the spammer has probably
made the backreference less expensive by marking the boundary
between the repeated strings with a period.  If the (relevant
part of the) addresses the spammer generated had looked
like this:

        cardiaccardiac
        adjudgeadjudge

that would have been more annoying to match than what they
actually used:

        cardiac.cardiac
        adjudge.adjudge

The reason is that you know exactly where the repeated string
(if it exists) must start, so all that is necessary is for the
regex engine is to collect everything up until the period,
then do a single check to see if there is a match (a check
that will usually fail on the very first character when the
engine is replaying its "tape" of the backreferenced string).
And that's O(N) (where N is the number of characters) in the
worst case, so not bad at all.

Actually, even without the period marking the spot where the
repeat will start it is easy in theory to efficiently match
these strings:  the repeat must start exactly in the middle
(since if a string is repeated, the repeat will be the same
length as the first occurrence).  If you have a string which is
2*N characters long and you want to see if the last N characters
are a repeat of the first N characters, you start by comparing
character 0 with character N, then compare character 1 with
character N+1, etc.  But, whether the regex machine would
ever use that technique is doubtful.  So, even though it's
possible in theory to match it efficiently without the "." as
a marker, the spammer has chosen a format that's relatively
easy to recognize.

  - Logan

Reply via email to