RE: Sorting HTML tables

Paul Harwood Thu, 05 Aug 2004 09:45:37 -0700

The table is fairly complicated. I'll take a look at those modules
though. Thanks!

-----Original Message-----
From: Chris Devers [mailto:[EMAIL PROTECTED] 
Posted At: Wednesday, August 04, 2004 5:03 PM
Posted To: Perl
Conversation: Sorting HTML tables
Subject: Re: Sorting HTML tables

On Wed, 4 Aug 2004, Perl wrote:

> I wrote some code to identify and print HTML tables below

Don't do that.

HTML is tremendously difficult to analyze properly with tools like 
regular expressions.

You're much, much better off using a proper parser library that can 
build up a tree model of the html that you can analyze as you like.

The standard libraries for this are probably HTML::Parser and 
HTML::Treebuilder. You may also like HTML::TableContentParser.

<http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm> 
<http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/TreeBuilder.pm> 
<http://search.cpan.org/~sdrabble/HTML-TableContentParser-0.13/TableCont
entParser.pm>

This may point you in a useful direction:

   use HTML::TableContentParser;
   $p      = HTML::TableContentParser->new();
   $html   = read_html_from_somewhere();
   $tables = $p->parse($html);
   for $t (@$tables) {
     for $r (@{$t->{rows}}) {
       print "Row: ";
       for $c (@{$r->{cells}}) {
         print "[$c->{data}] ";
       }
       print "\n";
     }
   }

Something like this should work even for godawful ms-html :-)

> The problem I am stuck with is that now I want to sort the tables 
> based on a Priority (which range from 1-3). There may be several 
> tables with the same priority numbers.  An example of a Priority 3 
> would be:
>
> # extraordinarily ugly html omitted
>
> I need help in understanding the methodology in how to extract these 2

> items and then sort the tables in Priority order (all the 1's, 2's and

> 3's).

It looks like HTML::TableContentParser makes sorting through the 
structure of the table pretty easy; HTML::Parser could go farther by 
reducing it down to just the printable text -- some combination of the 
two may be useful here.

Once you've stripped out all the junk (all the span tags, the paragraph 
tags, the "<o:p></o:p>" type debris, etc), you just need to do convert 
the html structure into some kind of populated data structure.

You didn't give enough of the html to suggest what the rest of the table

is structured like -- it was really just one big hairy table cell -- so 
it's hard to guess how the other pieces fit together.

Can you post a simpler example of what the table is built like, e.g.:

     +------------+-------+---------------+----------------+
     | priority 1 | field | another field | some more      |
     +------------+-------+---------------+----------------+
     | priority 3 | field | any data here | other things   |
     +------------+-------+---------------+----------------+
     | priority 2 | field | stuff stuff   | whatever       |
     +------------+-------+---------------+----------------+

Or is it more complcated than that?

-- 
Chris Devers      [EMAIL PROTECTED]
http://devers.homeip.net:8080/blog/

np: 'Lujon'
      by Henry Mancini
      from 'The Best Of Mancini'

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

RE: Sorting HTML tables

Reply via email to