http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4695
------- Additional Comments From [EMAIL PROTECTED] 2006-01-24 18:25 -------
I've been prodding this for a little bit. As far as I can tell, HTML::Parser
(I'm using 3.48) treats "<br>"
different than "<br/>". Specifically, "<br>" gets turned into "\n", whereas
"<br/>" turns into "". For
example:
<br/> gets:
This is a littletest with a fewwords perlineto keep itshort// Mene Tekal
<br> gets:
This
is a little
test
with a few
words per
line
to keep it
short
//
Mene
Tekal
So when HTML::html_text() looks for obfuscation, instead of seeing the previous
array element as
"\n" (not considered obfuscation), it sees text ("is a little", etc.) (it is
considered obfuscation)
This brings up: how do we deal with this, since the issue is HTML::Parser?
Looking at the POD, there's a
function to tweek parse() to handle "</>" as an empty element
($p->empty_element_tags()). Enabling
this seems to also cause HTML::Parser to not consider the trailing "/" as part
of the element name for
other elements:
$p->empty_element_tags
$p->empty_element_tags( $bool )
By default, empty element tags are not recognized as such and the
"/" before ">" is just treated like a nomal name character (unless
"strict_names" is enabled). Enabling this attribute make
"HTML::Parser" recognize these tags.
Empty element tags look like start tags, but end with the character
sequence "/>" instead of ">". When recognized by "HTML::Parser"
they cause an artificial end event in addition to the start event.
The "text" for the artificial end event will be empty and the
"tokenpos" array will be undefined even though the the token array
will have one element containg the tag name.
adding "$self->empty_element_tags(1);" to the end of M::SA::HTML::new() seems
to fix the issue. After
the change, $self->{text} with "<br/>" shows the same as "<br>" above.
So I'll put up a patch in a minute, though I'm not sure if setting the option
will have an effect on
anything else. If someone else with more HTML::Parser experience could chime
in, that'd be good. :)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.