Hi,
I need to parse and extract the body content from a bunch of xhtml files
using XMl::LibXML. I figured that should be possible since xhmtl is
supposed to be valid xml, right?
Here's the code that I'm using:
#!/usr/bin/perl
> use XML::LibXML;
> my $parser = XML::LibXML->new();
>
> my $doc = $parser->parse_file("xhtml.htm");
> my $docRoot = $doc->getDocumentElement;
> print $_->toString for $docRoot->findnodes("body")->shift->childNodes;
but I keep getting the error:
> "Can't call method "childNodes" on an undefined value"
as if it can't find a body element
Now, the file seems to parse without errors, and
> $docRoot-toString();
prints the whole html tag just fine.
After a couple of tests I found that it works if I remove the following
from the beginning of the document:
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml">
and replace it with just
> <html>
Why is that? Is there a way that I can make the script work WITH these
tags?
Thanks for any hint!
Ingo