Before I do something kludgey or think too hard trying to find a solution what others might have already encountered and solved, i thought I'd drop a quick line here to see anyone here had run into this...
When using HTML::Tree to parse HTML (using perl 5.6, HTML::Tree 3.11) enocoded HTML entities > 255 are not decoded. They are left untouched, still looking like • (or whatever number). Then during output with as_HTML(), the ampersand in that text gets encoded.
input: •
output: •
browser rendering of input: a bullet
browser rendering of output: •
I want to work around this in a way that will avoid breaking anything else. Telling as_HTML not to encode ampersands will break things because some ampersands *should* be encoded. I'm thinking of two solutions right now...
1: munge the input stream prior to parsing to mark all of the encoded entities > 255 so I can scan for them in the output stream and fix the output. This is the kludgey way.
2: modify the parser. This is the one I'll have to be most careful with, as I don't want any strange side effects getting in the way of the normal functioning of my tree object.
So, I thought I'd drop a line here before I went with a kludge or thunk too hard about the parser. Is this something anyone has already had to solve?
A good HTML test case and some relevant documentation quotes/links available at:
http://wickline.org/lists/libwww/encodings_broken.html
-matt
- Re: HTML::Tree -- encoding entities >255 -- conflict be... m_libwww_digest
- Re: HTML::Tree -- encoding entities >255 -- confli... m_libwww_digest
- Re: HTML::Tree -- encoding entities >255 -- co... Sean M. Burke