Re: HTML link regex

John Hardin Wed, 26 Sep 2012 07:43:22 -0700

On Wed, 26 Sep 2012, Martin Gregorie wrote:

On Tue, 2012-09-25 at 22:12 -0700, John Hardin wrote:

I'm thinking something like this, using what you presented as an example:


Generated internal pseudo-header:
    X-Spam-URL:    
http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec

...basically, URL-part|displayed-text-part

(Suggestions for a more appropriate delimiter than "|" are solicited...)

Repeat the header for each URL found that has displayed text; only
include those where the displayed text is not the same as the URL.

Then you could write a simple bounded rule like:

header  YT_LINK_SPOOF  X-Spam-URL  m,\|https?://[^/]*youtube\.com/watch,i

As long as you're already capturing the data, it might be useful to
generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with
URL-domain|displayed-text-domain only if the displayed text looks like a
URL. For example:

     X-Spam-URL-DomainOnly:   www.probono.fr|www.youtube.com


Of the two, I would prefer the second

I'm not really proposing alternatives, I suggest doing both. They servedifferent (but related) needs.

apart from the problem of matching the two halves if/when there is morethan one URL in a message.

I'm not following what you mean here, could you explain that in a bit moredetail? I'm not proposing a hash by URL. The same URL with differentdescriptions could appear multiple times in the message, and each onewould generate URL headers (modulo suppression of exact duplicates).

Either way, this type of test would only work on an HTML body part.

That was an unstated assumption on my part, as that's the only contextwhere this sort of "obfuscation" is even possible.

That said, I have another suggestion: if the HTML parser can build anassociative array, using the the URL as the element's key and the texthalf as the value it would be easy to either use a plugin to compare thekey and value.

Rules are easier to write on an ad-hoc basis than are plugins. My thinkingwas to let the plugin do the difficult part of extracting the data fromthe HTML and quoted-unreadable markup and present it to the rules in astandardized, easy-to-use form. Then a header rule (bounded) can bewritten to perform whatever further analysis is desired or suggested byspammer practices.

Alternatively, a new type of rule could be added to handle thecomparison.

Why? Adding extracted data as a pseudo-header is already used for Receivedheader analysis, why add an entirely new rule type for this analysis?

Note that a simple match/nomatch comparison would not hack it becausetags of the form
<a href="http://www.example.com";>My website</a>
should always be accepted and so should
<a href="http://www.example.com";>www.example.com</a>

Agreed. That's why I proposed two forms, and only capturing where thedescription is different from the URL. The first one would only appear inthe full URL header, the second one wouldn't appear in either (as notedbelow).

IOW, the comparison should only generate a hit if both halves of the tag
hold a URI

That prohibits possibly useful analysis of the description applied to aURL. I don't want to predict that will never be useful data.

and the second half may legitimately omit the http:// or https://prefix.


Agreed.

On Wed, 26 Sep 2012, Axb wrote:

have you looked at the URIDetail plugin ?


Not yet, thanks for pointing that out.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Show me somebody who waxes poetic about "Being at one with Nature"
  and I'll show you someone who hasn't figured out that Nature is an
  infinite stomach demanding to be fed.     -- Atomic, at Wapsi forum
-----------------------------------------------------------------------
 118 days since the first successful private support mission to ISS (SpaceX)

Re: HTML link regex

Reply via email to