Before I do something kludgey or think too hard trying to find a solution what others might have already encountered and solved, i thought I'd drop a quick line here to see anyone here had run into this...

When using HTML::Tree to parse HTML (using perl 5.6, HTML::Tree 3.11) enocoded HTML entities > 255 are not decoded. They are left untouched, still looking like • (or whatever number). Then during output with as_HTML(), the ampersand in that text gets encoded.

input: •
output: •

browser rendering of input: a bullet
browser rendering of output: •

I want to work around this in a way that will avoid breaking anything else. Telling as_HTML not to encode ampersands will break things because some ampersands *should* be encoded. I'm thinking of two solutions right now...

1: munge the input stream prior to parsing to mark all of the encoded entities > 255 so I can scan for them in the output stream and fix the output. This is the kludgey way.

2: modify the parser. This is the one I'll have to be most careful with, as I don't want any strange side effects getting in the way of the normal functioning of my tree object.

So, I thought I'd drop a line here before I went with a kludge or thunk too hard about the parser. Is this something anyone has already had to solve?

A good HTML test case and some relevant documentation quotes/links available at:

http://wickline.org/lists/libwww/encodings_broken.html

-matt

Reply via email to