Hi!
I propose to bring back the discussion started way back on february 18 ("Testing markup tags", "Semi-invisible font missed by SA").
There was a consensus that there's something definitely wrong with SpamAssassin HTML parsing when a spammer uses excessive line breaks inside HTML FONT tags between attribute name ("color") and value ("#FFFFsomething").
Back then, I've published sample messages here: http://olo.ab.altkom.pl/domowa/spam/samples/low_contrast/
The problem was, that the spammers use the following construct aimed directly at SpamAssassin HTML analysis method to bypass the test html_test('font_near_invisible') and not trigger the rule HTML_FONT_LOW_CONTRAST in effect:
<font color=
"#FFFFFB">some random text to fool Bayes</font>
The excessive line breaks between "color=" and "#FFFFFB" fool the parser to not detect the presence of that attribute.
I've analysed SpamAssasin 2.63 code back then in 23 Feb, and discovered that SA code indeed does receive a string "color" instead of hash code for the value of "color" attribute.
Those messages keep coming and sometimes pass through SA not triggering HTML_FONT_LOW_CONTRAST, and I'm currently using a custom rule to give them additional score:
rawbody LOC_HTMLSPLITFONT /^\"?\#[a-z0-9]{6}\"?\>/i
describe LOC_HTMLSPLITFONT font color on separate line from font tag
score LOC_HTMLSPLITFONT 2.1 1.6 2.1 1.6But this rule has a potential for FP-ing, so the ideal solution would be to make SpamAssassin parse those tags using HTML::Parser correctly.
I've made a test Perl script that parses HTML and outputs the attribute names and values, and running it indicates that HTML::Parser works fine. You can see the script and test data here:
http://olo.ab.altkom.pl/domowa/admin/spamassassin/
There are 4 files there:
My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml font_attribute_line_break_corrected.html font_attribute_line_break_orig.html parse_test.pl
The .eml file contains the message that has passed through not triggering HTML_FONT_LOW_CONTRAST.
The file parse_test.pl is the Perl script.
The 2 .html files contain the HTML code from the .eml message, the "_orig" one contains the code unchanged, the "_corrected" has excessive line breaks removed.
running parse_test.pl on both HTML files shows that HTML::Parser does its job fine in both cases, so the problem must lie somewhere in SpamAssassin code that does the parsing using HTML::Parser. However, the SA code is too bit to convoluted for me - so I'm asking its original author to have a look at it.
SA needs to be fixed to trigger HTML_FONT_LOW_CONTRAST rule when processing the message My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml.
--
Best Regards,
Aleksander Adamowski
GG#: 274614
ICQ UIN: 19780575 http://olo.ab.altkom.pl
