On Wed, 26 Sep 2012, Martin Gregorie wrote:

On Tue, 2012-09-25 at 22:12 -0700, John Hardin wrote:
I'm thinking something like this, using what you presented as an example:

Generated internal pseudo-header:
    X-Spam-URL:    
http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec

...basically, URL-part|displayed-text-part

(Suggestions for a more appropriate delimiter than "|" are solicited...)

Repeat the header for each URL found that has displayed text; only
include those where the displayed text is not the same as the URL.

Then you could write a simple bounded rule like:

header  YT_LINK_SPOOF  X-Spam-URL  m,\|https?://[^/]*youtube\.com/watch,i

As long as you're already capturing the data, it might be useful to
generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with
URL-domain|displayed-text-domain only if the displayed text looks like a
URL. For example:

     X-Spam-URL-DomainOnly:   www.probono.fr|www.youtube.com

Of the two, I would prefer the second

I'm not really proposing alternatives, I suggest doing both. They serve different (but related) needs.

apart from the problem of matching the two halves if/when there is more than one URL in a message.

I'm not following what you mean here, could you explain that in a bit more detail? I'm not proposing a hash by URL. The same URL with different descriptions could appear multiple times in the message, and each one would generate URL headers (modulo suppression of exact duplicates).

Either way, this type of test would only work on an HTML body part.

That was an unstated assumption on my part, as that's the only context where this sort of "obfuscation" is even possible.

That said, I have another suggestion: if the HTML parser can build an associative array, using the the URL as the element's key and the text half as the value it would be easy to either use a plugin to compare the key and value.

Rules are easier to write on an ad-hoc basis than are plugins. My thinking was to let the plugin do the difficult part of extracting the data from the HTML and quoted-unreadable markup and present it to the rules in a standardized, easy-to-use form. Then a header rule (bounded) can be written to perform whatever further analysis is desired or suggested by spammer practices.

Alternatively, a new type of rule could be added to handle the comparison.

Why? Adding extracted data as a pseudo-header is already used for Received header analysis, why add an entirely new rule type for this analysis?

Note that a simple match/nomatch comparison would not hack it because tags of the form
<a href="http://www.example.com";>My website</a>
should always be accepted and so should
<a href="http://www.example.com";>www.example.com</a>

Agreed. That's why I proposed two forms, and only capturing where the description is different from the URL. The first one would only appear in the full URL header, the second one wouldn't appear in either (as noted below).

IOW, the comparison should only generate a hit if both halves of the tag
hold a URI

That prohibits possibly useful analysis of the description applied to a URL. I don't want to predict that will never be useful data.

and the second half may legitimately omit the http:// or https:// prefix.

Agreed.

On Wed, 26 Sep 2012, Axb wrote:

have you looked at the URIDetail plugin ?

Not yet, thanks for pointing that out.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Show me somebody who waxes poetic about "Being at one with Nature"
  and I'll show you someone who hasn't figured out that Nature is an
  infinite stomach demanding to be fed.     -- Atomic, at Wapsi forum
-----------------------------------------------------------------------
 118 days since the first successful private support mission to ISS (SpaceX)

Reply via email to