On Monday, January 5, 2004, at 03:54 PM, Eric Lease Morgan wrote:

To create my HTML files with rich meta data, I need to extract bits and
pieces of information from the teiHeader of my originals. The snippet of
code below illustrates how I am currently doing this with XML::LibXML:


[...]

The code works, but is really slow. Can you suggest a way to improve my code
or use some other technique for extracting things like author, title, and id
from my XML?

Check out XML::Twig, which uses XML::Parser. It gives you -- in tree form -- only those elements you're interested in. From the README:


One of the strengths of XML::Twig is that it let you work with files that
do not fit in memory (BTW storing an XML document in memory as a tree is
quite memory-expensive, the expansion factor being often around 10).


To do this you can define handlers, that will be called once a specific
element has been completely parsed.


I *think* your code would then look like this:

use XML::Twig;

my ($author, $title, $id);

my $twig = XML::Twig->new('twig_roots' => {
'teiHeader/fileDesc/titleStmt/author' => sub { $author = $_[1] },
'teiHeader/fileDesc/titleStmt/title' => sub { $title = $_[1] },
'teiHeader/fileDesc/publicationStmt/idno' => sub { $id = $_[1] },
})->parsefile('/foo/bar.xml');


$twig->purge;

This is totally untested -- I don't even have XML::Twig installed, I'm just going by the documentation on CPAN.

For more info (including a tutorial) see <URL:http://www.xmltwig.com/xmltwig/>.

Paul.

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/



Reply via email to