Re: Extracting data from an XML file

Paul Hoffman Mon, 05 Jan 2004 15:06:41 -0800

On Monday, January 5, 2004, at 03:54 PM, Eric Lease Morgan wrote:

To create my HTML files with rich meta data, I need to extract bits and pieces of information from the teiHeader of my originals. The snippet of code below illustrates how I am currently doing this with XML::LibXML:

[...]

The code works, but is really slow. Can you suggest a way to improve my code or use some other technique for extracting things like author, title, and id from my XML?

Check out XML::Twig, which uses XML::Parser. It gives you -- in tree form -- only those elements you're interested in. From the README:

One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).

To do this you can define handlers, that will be called once a specific element has been completely parsed.

I *think* your code would then look like this:

use XML::Twig;

my ($author, $title, $id);

my $twig = XML::Twig->new('twig_roots' => { 'teiHeader/fileDesc/titleStmt/author' => sub { $author = $_[1] }, 'teiHeader/fileDesc/titleStmt/title' => sub { $title = $_[1] }, 'teiHeader/fileDesc/publicationStmt/idno' => sub { $id = $_[1] }, })->parsefile('/foo/bar.xml');

$twig->purge;

This is totally untested -- I don't even have XML::Twig installed, I'm just going by the documentation on CPAN.

For more info (including a tutorial) see <URL:http://www.xmltwig.com/xmltwig/>.

Paul.

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/

Re: Extracting data from an XML file

Reply via email to