Dermot Paikkos wrote: > Hi, > > I am trying to parse the data out of am XML file. The file is below. > Most of the data is easily grabbed but the keywords stretch over > several newlines and there can anywhere between 0 and 20 entries. I > have tried using /m and /s but these don't seem to work. I have set > $/="<image>", I don't know if this is impacting on my attempts. But > changing it does help either. > > Here is what I am using at the moment: > ============== > my $datafile = "news.xml"; > open(FH,$datafile)|| die "Can't open $datafile: $!\n"; > while (defined($i=<FH>)) { > $/ = </image>; > if ( $i =~ /\?xml version*/ ) { > next; > } > (my $splnum) = ($i =~ /<image number=.(\w\d+\/\d+)/i); > (my $title) = ($i =~ /<title>(.*)<\/title>/); > (my $date ) = ($i =~ /<date>(.*)<\/date>/); > (my $credit) = ($i =~ /<credit>(.*)<\/credit>/); > (my $caption) = ($i =~ /<caption>(.*)<\/caption>/); > (my $keywords) = ($i =~ /<keyword>(.*)<\/keyword>/); > chomp($splnum,$title,$date,$credit); > print "$splnum $title $date $credit $keywords\n"; > } > =============== > > This only grabs the first keyword (NERVE FIBRE, OVERLAPPING) and I > need them all. Also the processing seems to stop after to records > when there are 470 in $datafile!!. I can't work that out either. > > Any ideas? There are a lot of xml modules out there butI don't know if > any would help. > Thanx. > Dp. > > > =========== news.xml ============ > <?xml version='1.0'?> > <images> > <image number='P350/041'> > <title>Coloured SEM of two overlapping nerve fibres</title> > <date>09-Jul-98</date> > <credit>CREDIT: JUERGEN BERGER, MAX-PLANCK > INSTITUTE/SCIENCE PHOTO LIBRARY</credit> > <caption>CREDIT: JUERGEN BERGER, MAX-PLANCK INSTITUTE/ > SCIENCE PHOTO LIBRARY Nerve fibres. Coloured scanning electron > micrograph (SEM) of overlapping nerve fibres. Each fibre is made up > of several individual axons. An axon is a long extension from a nerve > cell (or neurone) which is the main output process of the cell. Some > small neurone cell bodies (rounded) can be seen here alongside the > axons. Nerve fibres rapidly relay signals between the central nervous > system (the brain and spinal cord) and muscles and organs in the > body. This allows the body to react quickly to any situation. > Magnification unknown.</caption> > <keywords> > <keyword>NERVE FIBRE, OVERLAPPING</keyword> > <keyword>AXON, NERVE FIBRE, OVERLAPPING</keyword> > <keyword>FIBRE, NERVE, OVERLAPPING</keyword> > <keyword>NERVE CELL, WITH FIBRES</keyword> > <keyword>NEURONE, WITH NERVE FIBRES</keyword> > <keyword>HUMAN BODY, ANATOMY, NERVOUS</keyword> > <keyword>SYSTEM, NERVE FIBRE, FIBRES</keyword> > </keywords> > </image> > </images> > ~~ > Dermot Paikkos * [EMAIL PROTECTED] > Network Administrator @ Science Photo Library > Phone: 0207 432 1100 * Fax: 0207 286 8668
trying to do this with a reg. expression is unwise. there are a number of module out there that can help you quickly find what you need in a XML file. one of those module is XML::Parser. you can use it like: #!/usr/bin/perl -w use strict; use XML::Parser; my $kw = 0; my $kws = ''; my $xml = new XML::Parser(Handlers => {Start => \&start, End => \&end, Char => \&string}); open(XML,'your.xml') || die $!; $xml->parse(*XML); close(XML); sub start{ $kw = 1 if($_[1] eq 'keyword'); } sub end{ if($_[1] eq 'keyword'){ print "get one keyword: $kws\n"; $kws = ''; $kw = 0; } } sub string{ $kws .= $_[1] if($kw && $_[1] =~ /\S/); } __END__ the above only extract things inside the <keyword> tag from the XML file. but you can apply the same technique to the other tags. i didn't really teset the above but hope that should give you something to look into. much easier than writing tons of reg. exp. right? :-) david -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]