Dermot Paikkos wrote:

> Hi,
> 
> I am trying to parse the data out of am XML file. The file is below.
> Most of the data is easily grabbed but the keywords stretch over
> several newlines and there can anywhere between 0 and 20 entries. I
> have tried using /m and /s but these don't seem to work. I have set
> $/="<image>", I don't know if this is impacting on my attempts. But
> changing it does help either.
> 
> Here is what I am using at the moment:
> ==============
> my $datafile = "news.xml";
> open(FH,$datafile)|| die "Can't open $datafile: $!\n";
> while (defined($i=<FH>)) {
>         $/ = </image>;
>         if ( $i =~ /\?xml version*/ ) {
>                 next;
>         }
>         (my $splnum) = ($i =~ /<image number=.(\w\d+\/\d+)/i);
>         (my $title) = ($i =~ /<title>(.*)<\/title>/);
>         (my $date ) = ($i =~ /<date>(.*)<\/date>/);
>         (my $credit) = ($i =~ /<credit>(.*)<\/credit>/);
>         (my $caption) = ($i =~ /<caption>(.*)<\/caption>/);
>         (my $keywords) = ($i =~ /<keyword>(.*)<\/keyword>/);
>         chomp($splnum,$title,$date,$credit);
>         print "$splnum $title $date $credit $keywords\n";
>  }
> ===============
> 
> This only grabs the first keyword (NERVE FIBRE, OVERLAPPING) and I
> need them all. Also the processing seems to stop after to records
> when there are 470 in $datafile!!. I can't work that out either.
> 
> Any ideas? There are a lot of xml modules out there butI don't know if
> any would help.
> Thanx.
> Dp.
> 
> 
> =========== news.xml ============
> <?xml version='1.0'?>
> <images>
> <image number='P350/041'>
> <title>Coloured SEM of two overlapping nerve fibres</title>
> <date>09-Jul-98</date>
> <credit>CREDIT: JUERGEN BERGER, MAX-PLANCK
> INSTITUTE/SCIENCE PHOTO LIBRARY</credit>
> <caption>CREDIT: JUERGEN BERGER, MAX-PLANCK INSTITUTE/
> SCIENCE PHOTO LIBRARY Nerve fibres. Coloured scanning electron
> micrograph (SEM) of overlapping nerve fibres. Each fibre is made up
> of several individual axons. An axon is a long extension from a nerve
> cell (or neurone) which is the main output process of the cell. Some
> small neurone cell bodies (rounded) can be seen here alongside the
> axons. Nerve fibres rapidly relay signals between the central nervous
> system (the brain and spinal cord) and muscles and organs in the
> body. This allows the body to react quickly to any situation.
> Magnification unknown.</caption>
> <keywords>
> <keyword>NERVE FIBRE, OVERLAPPING</keyword>
> <keyword>AXON, NERVE FIBRE, OVERLAPPING</keyword>
> <keyword>FIBRE, NERVE, OVERLAPPING</keyword>
> <keyword>NERVE CELL, WITH FIBRES</keyword>
> <keyword>NEURONE, WITH NERVE FIBRES</keyword>
> <keyword>HUMAN BODY, ANATOMY, NERVOUS</keyword>
> <keyword>SYSTEM, NERVE FIBRE, FIBRES</keyword>
> </keywords>
> </image>
> </images>
> ~~
> Dermot Paikkos * [EMAIL PROTECTED]
> Network Administrator @ Science Photo Library
> Phone: 0207 432 1100 * Fax: 0207 286 8668

trying to do this with a reg. expression is unwise. there are a number of 
module out there that can help you quickly find what you need in a XML 
file. one of those module is XML::Parser. you can use it like:

#!/usr/bin/perl -w
use strict;
use XML::Parser;

my $kw = 0;
my $kws = '';
my $xml = new XML::Parser(Handlers => {Start => \&start, 
                                       End => \&end, 
                                       Char => \&string});
open(XML,'your.xml') || die $!;
$xml->parse(*XML);
close(XML);

sub start{
        $kw = 1 if($_[1] eq 'keyword');
}

sub end{
        if($_[1] eq 'keyword'){
                print "get one keyword: $kws\n";
                $kws = '';
                $kw = 0;
        }
}

sub string{
        $kws .= $_[1] if($kw && $_[1] =~ /\S/);
}

__END__

the above only extract things inside the <keyword> tag from the XML file. 
but you can apply the same technique to the other tags. i didn't really 
teset the above but hope that should give you something to look into.

much easier than writing tons of reg. exp. right? :-)

david

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to