RE: more specific html parsing question..

Thomas, Mark - BLS CTR Fri, 04 Jun 2004 02:39:51 -0700

> in looking at the issue in more depth.. the real question i 
> have is how can one take a "list" of items were the parent 
> has descendants, and simply "pop" off the "text" for the 
> parent/root node without the children???


An alternate way is to use the HTML parsing capabilities of LibXML. Then you
can get the nodes you want using XPath. There are many advantages to using
XPath to parse XML and HTML. Not only is it a standard and relatively
simple, but since they are strings they can be separated from the logic of
your program (perhaps in a configuration variable or file) to better
accommodate changes in the (x)html you're parsing.

Pulling what you want from the file is a one-liner:

  print $doc->findvalue('//[EMAIL PROTECTED]"em"]/text()');

The Xpath says take the span with the class of "em" and its text. It won't
take the child nodes' text unless you add a slash ( '//text()' instead of
'/text()' ).

Full working example:

#!perl -w
use strict;
use XML::LibXML;

my $html = q(
<html>
<head><title></title></head>
<body>
<table>
<tr class="tbon"> @0.1.1.0.0.0.2.1.0.0.0.3.1
  <td colspan=7> @0.1.1.0.0.0.2.1.0.0.0.3.1.0
    <p class="tbtx"> @0.1.1.0.0.0.2.1.0.0.0.3.1.0.0
      <span class="em"> @0.1.1.0.0.0.2.1.0.0.0.3.1.0.0.0
        "ACCA  310F "
        <span class="on"> @0.1.1.0.0.0.2.1.0.0.0.3.1.0.0.0.1
          "FOUNDATIONS OF ACCOUNTING"
      " A A "
      <b> @0.1.1.0.0.0.2.1.0.0.0.3.1.0.0.2
          </b>
          </span>
          </span>
        </p>
  </td>
</tr>
</table>
</body>
</html>
);

my $doc = XML::LibXML
        ->new({recover=>1})
        ->parse_html_string($html); #or parse_html_file

#Print the text of the span with class of "em"
print $doc->findvalue('//[EMAIL PROTECTED]"em"]/text()');


-- 
Mark Thomas                    [EMAIL PROTECTED]
Internet Systems Architect     User Technology Associates, Inc.

$_=q;KvtuyboopuifeyQQfeemyibdlfee;; y.e.s. ;y+B-x+A-w+s; ;y;y; ;;print;;

RE: more specific html parsing question..

Reply via email to