HTML::Tree -- encoding entities >255 -- conflict between parse() and as_HTML()

m_libwww_digest Sun, 15 Dec 2002 14:55:55 -0800

Before I do something kludgey or think too hard trying to find a solution what others might have already encountered and solved, i thought I'd drop a quick line here to see anyone here had run into this...

When using HTML::Tree to parse HTML (using perl 5.6, HTML::Tree 3.11) enocoded HTML entities > 255 are not decoded. They are left untouched, still looking like • (or whatever number). Then during output with as_HTML(), the ampersand in that text gets encoded.

input: •
output: &#8226;

browser rendering of input: a bullet
browser rendering of output: •

I want to work around this in a way that will avoid breaking anything else. Telling as_HTML not to encode ampersands will break things because some ampersands *should* be encoded. I'm thinking of two solutions right now...

1: munge the input stream prior to parsing to mark all of the encoded entities > 255 so I can scan for them in the output stream and fix the output. This is the kludgey way.

2: modify the parser. This is the one I'll have to be most careful with, as I don't want any strange side effects getting in the way of the normal functioning of my tree object.

So, I thought I'd drop a line here before I went with a kludge or thunk too hard about the parser. Is this something anyone has already had to solve?

A good HTML test case and some relevant documentation quotes/links available at:

http://wickline.org/lists/libwww/encodings_broken.html

-matt

HTML::Tree -- encoding entities >255 -- conflict between parse() and as_HTML()

Reply via email to