Thank you very much for this script.

And if English had been my native lauage it had been perfect.

I need the script to add my native letters as well, I'm Swedish and
therfor use "������"
in the articles. Is this possible?

And in the text some words is wrapped in <HIT></HIT> tags, is it
possible to remove the hit tag and show the word between the tags?

Raven

Chas Owens wrote:
> 
> Please, please, please, do not try to parse XML with regexps.  They only
> work in the simplest cases.  There are perfectly good XML modules
> designed to parse XML for you and they are not that hard to use.
> 
> The following code parses an XML file similar to the one you described,
> but has an additional tag (<ARTICLES></ARTICLES>) since XML must have
> one and only one root tag.  I added this tag because I thought you have
> more than one article per file.  If this is true then the XML you
> described is not well formed.  However it would be a simple process to
> wrap this tag around the file before attempting to parse it.  If there
> is in fact only one article per file then remove the outer foreach and
> replace $articles->children with $xmlobj->children.
> 
> <code>
> #!/usr/bin/perl -w
> 
> use strict;
> use XML::Parser;       #parse XML into an internal format
> use XML::SimpleObject; #easy to use forntend to XML::Parse
> 
> if (@ARGV != 2) { die "Usage: $0 news.xml index.html" }
> 
> my $parser = new XML::Parser (ErrorContext => 2, Style => "Tree");
> my $xmlobj = new XML::SimpleObject ($parser->parsefile($ARGV[0]));
> 
> open HTML, ">$ARGV[1]" or die "Could not open $ARGV[1]:$!";
> select HTML;
> 
> print "
> <html>
> <head>
> <title>
> News Articles for " . localtime() .  "
> </title>
> </head>
> <body>
> <table>";
> 
> foreach my $articles ($xmlobj->children) { #get the top tag
>    foreach my $article ($articles->children) { #get all articles
>       my $file = $article->child('PUB')->value . '-' .
>                  $article->child('RUB')->value . '-' .
>                  $article->child('LEV')->value . '-' .
>                  $article->child('DAT')->value;
>       $file =~ s/[^\w.-]//g; #remove anything not alphanumeric, _, -, or
> .
>       open FH, ">$file" or die "Could not open $file:$!";
>       print FH $article->child('BRO')->value;
>       close FH;
>       print
> "<tr><td>", $article->child('ORD')->value, "</td></tr>\n",
> "<tr><td>", $article->child('LEV')->value, "</td></tr>\n",
> "<tr><td>", $article->child('DAT')->value, "</td></tr>\n",
> "<tr><td>", $article->child('PUB')->value, "</td></tr>\n",
> "<tr><td><a href=\"$file\">", $article->child('RUB')->value,
> "</a></td></tr>\n","<tr><td>", $article->child('INL')->value,
> "</td></tr>\n",
> "<tr><td></td></tr>";
>    }
> }
> 
> print "
> </table>
> </body>
> </html>";
> 
> close HTML;
> </code>
> 
> On 19 Jun 2001 13:34:03 +0100, Nigel Wetters wrote:
> > I think I can give you some clues. Here's some code out of the Perl Cookbook (6.8 
>Extracting a Range of Lines), which I've adapted for you. You should be able to nest 
>such structures to get what you want.
> >
> > my $extracted_lines = '';
> > while (<>) {
> >     if (/BEGIN PATTERN/ .. /END PATTERN/) {
> >         # line falls between BEGIN and END in the
> >         # text, inclusive
> >         $extracted_lines .= $_;
> >     } else {
> >         # now, we're outside the pattern
> >         process($extracted_lines) if $extracted_lines;
> >         $extracted_lines = '';
> >     }
> > }
> > sub process
> > {
> >     # do stuff with the extracted lines
> >     # maybe performing more regex's
> > }
> >
> > >>> Morgan <[EMAIL PROTECTED]> 06/19/01 01:12pm >>>
> > Hi
> >
> > I'm newbee perl developer and a rookie of xml :(
> >
> > Is there anyone who can give me some hints or help me out with a problem
> > I have?
> >
> > Here is the problem.
> > I will recive newsarticles three times a day in xml format and I need to
> > automaticly publish those articels on a web page, on the first page it
> > should only show the tags down to </INL>
> > tag and a link to the whole page.
> >
> > Here is a sample of the xml format.
> >
> > <ART>
> > <ORD>anbud</ORD>
> > <LEV>2001-06-14</LEV>
> > <DAT>14-06-01</DAT>
> > <PUB>DAGENS INDUSTRI</PUB>
> > <RUB>Dragkamp om f�rlustt�g</RUB>
> > <INL>Here is the indroduction about the article and when the word
> > anbud comes up it is enclosed in <HIT>anbud</HIT> tags.
> > This is the word we use as criteria on the articels we should recive.
> > </INL>
> > <BRO>
> > Here comes the rest of the document, thats the whole article.
> > The article ends with
> > </BRO>
> > </ART>
> >
> >
> > Raven
> >
> >
> >
> > This e-mail and any files transmitted with it are confidential
> > and solely for the use of the intended recipient.
> > ONdigital plc, 346 Queenstown Road, London SW8 4DG. Reg No: 3302715.
> >
> --
> Today is Setting Orange, the 24th day of Confusion in the YOLD 3167
> Wibble.

Reply via email to