HTML::TreeBuilder/HTML::Parser - problem parsing tables

Neven Luetic Mon, 05 Apr 2004 07:30:12 -0700

Hello,

I wrote a small application to collect samples of pages from sites to do
some usability checking offline. So it's necessary that the archived
pages match the original exactly, when displayed.


As some tests on the pages are going to be automated using tags or
attributes as search criteria and as it was necessary to rewrite any
links to pictures inside the pages, I decided to use HTML::TreeBuilder
for this.

However, I encountered a critical difference of pages read using
HTML::TreeBuilder->parse() for parsing and  HTML::TreeBuilder->as_HTML
for writing to the original: in several german newspaper sites, that are
using big tables for their layout, some tables are closed too early by
the parser. The effect is, that from that point onward the table-cells
are displayed row by row (this is true for every browser I tried -
mozilla, firefox, opera, ie6), while the original page looks ok. 

I tried setting HTML::TreeBuilder->implicit_tags(0) (this will be my
default setting anyway), but it didn't change the behavior. So I
suppose, the problem is not with some routine *adding* tags that are
proposed to be missing, but with the parser itself, misinterpreting the
tree.

Does anybody have an idea about what the problem might be and how I
could solve this?

I'm pretty stuck, as nearly a quarter of all (newspaper and magazine)
sites tested have this problem, so that it renders the script virtually
useless.

Greetings

Neven

HTML::TreeBuilder/HTML::Parser - problem parsing tables

Reply via email to