Thank you very much for this script.

And if English had been my native lauage it had been perfect.

I need the script to add my native letters as well, I'm Swedish and
therfor use "åäöÅÄÖ"
in the articles. Is this possible?

And in the text some words is wrapped in <HIT></HIT> tags, is it
possible to remove the hit tag and show the word between the tags?

Raven

Chas Owens wrote:
> 
> Please, please, please, do not try to parse XML with regexps.  They only
> work in the simplest cases.  There are perfectly good XML modules
> designed to parse XML for you and they are not that hard to use.
> 
> The following code parses an XML file similar to the one you described,
> but has an additional tag (<ARTICLES></ARTICLES>) since XML must have
> one and only one root tag.  I added this tag because I thought you have
> more than one article per file.  If this is true then the XML you
> described is not well formed.  However it would be a simple process to
> wrap this tag around the file before attempting to parse it.  If there
> is in fact only one article per file then remove the outer foreach and
> replace $articles->children with $xmlobj->children.
> 
> <code>
> #!/usr/bin/perl -w
> 
> use strict;
> use XML::Parser;       #parse XML into an internal format
> use XML::SimpleObject; #easy to use forntend to XML::Parse
> 
> if (@ARGV != 2) { die "Usage: $0 news.xml index.html" }
> 
> my $parser = new XML::Parser (ErrorContext => 2, Style => "Tree");
> my $xmlobj = new XML::SimpleObject ($parser->parsefile($ARGV[0]));
> 
> open HTML, ">$ARGV[1]" or die "Could not open $ARGV[1]:$!";
> select HTML;
> 
> print "
> <html>
> <head>
> <title>
> News Articles for " . localtime() .  "
> </title>
> </head>
> <body>
> <table>";
> 
> foreach my $articles ($xmlobj->children) { #get the top tag
>    foreach my $article ($articles->children) { #get all articles
>       my $file = $article->child('PUB')->value . '-' .
>                  $article->child('RUB')->value . '-' .
>                  $article->child('LEV')->value . '-' .
>                  $article->child('DAT')->value;
>       $file =~ s/[^\w.-]//g; #remove anything not alphanumeric, _, -, or
> .
>       open FH, ">$file" or die "Could not open $file:$!";
>       print FH $article->child('BRO')->value;
>       close FH;
>       print
> "<tr><td>", $article->child('ORD')->value, "</td></tr>\n",
> "<tr><td>", $article->child('LEV')->value, "</td></tr>\n",
> "<tr><td>", $article->child('DAT')->value, "</td></tr>\n",
> "<tr><td>", $article->child('PUB')->value, "</td></tr>\n",
> "<tr><td><a href=\"$file\">", $article->child('RUB')->value,
> "</a></td></tr>\n","<tr><td>", $article->child('INL')->value,
> "</td></tr>\n",
> "<tr><td></td></tr>";
>    }
> }
> 
> print "
> </table>
> </body>
> </html>";
> 
> close HTML;
> </code>
> 
> On 19 Jun 2001 13:34:03 +0100, Nigel Wetters wrote:
> > I think I can give you some clues. Here's some code out of the Perl Cookbook (6.8 
>Extracting a Range of Lines), which I've adapted for you. You should be able to nest 
>such structures to get what you want.
> >
> > my $extracted_lines = '';
> > while (<>) {
> >     if (/BEGIN PATTERN/ .. /END PATTERN/) {
> >         # line falls between BEGIN and END in the
> >         # text, inclusive
> >         $extracted_lines .= $_;
> >     } else {
> >         # now, we're outside the pattern
> >         process($extracted_lines) if $extracted_lines;
> >         $extracted_lines = '';
> >     }
> > }
> > sub process
> > {
> >     # do stuff with the extracted lines
> >     # maybe performing more regex's
> > }
> >
> > >>> Morgan <[EMAIL PROTECTED]> 06/19/01 01:12pm >>>
> > Hi
> >
> > I'm newbee perl developer and a rookie of xml :(
> >
> > Is there anyone who can give me some hints or help me out with a problem
> > I have?
> >
> > Here is the problem.
> > I will recive newsarticles three times a day in xml format and I need to
> > automaticly publish those articels on a web page, on the first page it
> > should only show the tags down to </INL>
> > tag and a link to the whole page.
> >
> > Here is a sample of the xml format.
> >
> > <ART>
> > <ORD>anbud</ORD>
> > <LEV>2001-06-14</LEV>
> > <DAT>14-06-01</DAT>
> > <PUB>DAGENS INDUSTRI</PUB>
> > <RUB>Dragkamp om förlusttåg</RUB>
> > <INL>Here is the indroduction about the article and when the word
> > anbud comes up it is enclosed in <HIT>anbud</HIT> tags.
> > This is the word we use as criteria on the articels we should recive.
> > </INL>
> > <BRO>
> > Here comes the rest of the document, thats the whole article.
> > The article ends with
> > </BRO>
> > </ART>
> >
> >
> > Raven
> >
> >
> >
> > This e-mail and any files transmitted with it are confidential
> > and solely for the use of the intended recipient.
> > ONdigital plc, 346 Queenstown Road, London SW8 4DG. Reg No: 3302715.
> >
> --
> Today is Setting Orange, the 24th day of Confusion in the YOLD 3167
> Wibble.

Reply via email to