Hi,

I need to parse and extract the body content from a bunch of xhtml files
using XMl::LibXML. I figured that should be possible since xhmtl is
supposed to be valid xml, right?

Here's the code that I'm using:


#!/usr/bin/perl

> use XML::LibXML;
> my $parser = XML::LibXML->new();
> 
> my $doc = $parser->parse_file("xhtml.htm");
> my $docRoot = $doc->getDocumentElement;

> print $_->toString for $docRoot->findnodes("body")->shift->childNodes;

but I keep getting the error:

> "Can't call method "childNodes" on an undefined value"
as if it can't find a body element


Now, the file seems to parse without errors, and

> $docRoot-toString();
 
prints the whole html tag just fine.


After a couple of tests I found that it works if I remove the following
from the beginning of the document:

> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> <html xmlns="http://www.w3.org/1999/xhtml";>

and replace it with just 

> <html>


Why is that? Is there a way that I can make the script work WITH these
tags?

Thanks for any hint!
Ingo

Reply via email to