Testing markup tags, again

Aleksander Adamowski 28 May 2004 10:23:15 -0000

Hi!

I propose to bring back the discussion started way back on february 18 ("Testing markup tags", "Semi-invisible font missed by SA").

There was a consensus that there's something definitely wrong with SpamAssassin HTML parsing when a spammer uses excessive line breaks inside HTML FONT tags between attribute name ("color") and value ("#FFFFsomething").

Back then, I've published sample messages here:
http://olo.ab.altkom.pl/domowa/spam/samples/low_contrast/

The problem was, that the spammers use the following construct aimed directly at SpamAssassin HTML analysis method to bypass the test html_test('font_near_invisible') and not trigger the rule HTML_FONT_LOW_CONTRAST in effect: <font color=

"#FFFFFB">some random text to fool Bayes</font>

The excessive line breaks between "color=" and "#FFFFFB" fool the parser to not detect the presence of that attribute.

I've analysed SpamAssasin 2.63 code back then in 23 Feb, and discovered that SA code indeed does receive a string "color" instead of hash code for the value of "color" attribute.

Those messages keep coming and sometimes pass through SA not triggering HTML_FONT_LOW_CONTRAST, and I'm currently using a custom rule to give them additional score:

rawbody LOC_HTMLSPLITFONT  /^\"?\#[a-z0-9]{6}\"?\>/i
describe LOC_HTMLSPLITFONT font color on separate line from font tag
score LOC_HTMLSPLITFONT    2.1 1.6 2.1 1.6

But this rule has a potential for FP-ing, so the ideal solution would be to make SpamAssassin parse those tags using HTML::Parser correctly.

I've made a test Perl script that parses HTML and outputs the attribute names and values, and running it indicates that HTML::Parser works fine. You can see the script and test data here: http://olo.ab.altkom.pl/domowa/admin/spamassassin/

There are 4 files there:

My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml
font_attribute_line_break_corrected.html
font_attribute_line_break_orig.html
parse_test.pl

The .eml file contains the message that has passed through not triggering HTML_FONT_LOW_CONTRAST. The file parse_test.pl is the Perl script. The 2 .html files contain the HTML code from the .eml message, the "_orig" one contains the code unchanged, the "_corrected" has excessive line breaks removed.

running parse_test.pl on both HTML files shows that HTML::Parser does its job fine in both cases, so the problem must lie somewhere in SpamAssassin code that does the parsing using HTML::Parser. However, the SA code is too bit to convoluted for me - so I'm asking its original author to have a look at it.

SA needs to be fixed to trigger HTML_FONT_LOW_CONTRAST rule when processing the message My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml.

-- Best Regards, Aleksander Adamowski GG#: 274614 ICQ UIN: 19780575 http://olo.ab.altkom.pl

Testing markup tags, again

Reply via email to