On Wed, 26 Sep 2012, Martin Gregorie wrote:
On Tue, 2012-09-25 at 22:12 -0700, John Hardin wrote:
I'm thinking something like this, using what you presented as an example:
Generated internal pseudo-header:
X-Spam-URL:
http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec
...basically, URL-part|displayed-text-part
(Suggestions for a more appropriate delimiter than "|" are solicited...)
Repeat the header for each URL found that has displayed text; only
include those where the displayed text is not the same as the URL.
Then you could write a simple bounded rule like:
header YT_LINK_SPOOF X-Spam-URL m,\|https?://[^/]*youtube\.com/watch,i
As long as you're already capturing the data, it might be useful to
generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with
URL-domain|displayed-text-domain only if the displayed text looks like a
URL. For example:
X-Spam-URL-DomainOnly: www.probono.fr|www.youtube.com
Of the two, I would prefer the second
I'm not really proposing alternatives, I suggest doing both. They serve
different (but related) needs.
apart from the problem of matching the two halves if/when there is more
than one URL in a message.
I'm not following what you mean here, could you explain that in a bit more
detail? I'm not proposing a hash by URL. The same URL with different
descriptions could appear multiple times in the message, and each one
would generate URL headers (modulo suppression of exact duplicates).
Either way, this type of test would only work on an HTML body part.
That was an unstated assumption on my part, as that's the only context
where this sort of "obfuscation" is even possible.
That said, I have another suggestion: if the HTML parser can build an
associative array, using the the URL as the element's key and the text
half as the value it would be easy to either use a plugin to compare the
key and value.
Rules are easier to write on an ad-hoc basis than are plugins. My thinking
was to let the plugin do the difficult part of extracting the data from
the HTML and quoted-unreadable markup and present it to the rules in a
standardized, easy-to-use form. Then a header rule (bounded) can be
written to perform whatever further analysis is desired or suggested by
spammer practices.
Alternatively, a new type of rule could be added to handle the
comparison.
Why? Adding extracted data as a pseudo-header is already used for Received
header analysis, why add an entirely new rule type for this analysis?
Note that a simple match/nomatch comparison would not hack it because
tags of the form
<a href="http://www.example.com">My website</a>
should always be accepted and so should
<a href="http://www.example.com">www.example.com</a>
Agreed. That's why I proposed two forms, and only capturing where the
description is different from the URL. The first one would only appear in
the full URL header, the second one wouldn't appear in either (as noted
below).
IOW, the comparison should only generate a hit if both halves of the tag
hold a URI
That prohibits possibly useful analysis of the description applied to a
URL. I don't want to predict that will never be useful data.
and the second half may legitimately omit the http:// or https://
prefix.
Agreed.
On Wed, 26 Sep 2012, Axb wrote:
have you looked at the URIDetail plugin ?
Not yet, thanks for pointing that out.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Show me somebody who waxes poetic about "Being at one with Nature"
and I'll show you someone who hasn't figured out that Nature is an
infinite stomach demanding to be fed. -- Atomic, at Wapsi forum
-----------------------------------------------------------------------
118 days since the first successful private support mission to ISS (SpaceX)