Thank you very much for this script.
And if English had been my native lauage it had been perfect.
I need the script to add my native letters as well, I'm Swedish and
therfor use "������"
in the articles. Is this possible?
And in the text some words is wrapped in <HIT></HIT> tags, is it
possible to remove the hit tag and show the word between the tags?
Raven
Chas Owens wrote:
>
> Please, please, please, do not try to parse XML with regexps. They only
> work in the simplest cases. There are perfectly good XML modules
> designed to parse XML for you and they are not that hard to use.
>
> The following code parses an XML file similar to the one you described,
> but has an additional tag (<ARTICLES></ARTICLES>) since XML must have
> one and only one root tag. I added this tag because I thought you have
> more than one article per file. If this is true then the XML you
> described is not well formed. However it would be a simple process to
> wrap this tag around the file before attempting to parse it. If there
> is in fact only one article per file then remove the outer foreach and
> replace $articles->children with $xmlobj->children.
>
> <code>
> #!/usr/bin/perl -w
>
> use strict;
> use XML::Parser; #parse XML into an internal format
> use XML::SimpleObject; #easy to use forntend to XML::Parse
>
> if (@ARGV != 2) { die "Usage: $0 news.xml index.html" }
>
> my $parser = new XML::Parser (ErrorContext => 2, Style => "Tree");
> my $xmlobj = new XML::SimpleObject ($parser->parsefile($ARGV[0]));
>
> open HTML, ">$ARGV[1]" or die "Could not open $ARGV[1]:$!";
> select HTML;
>
> print "
> <html>
> <head>
> <title>
> News Articles for " . localtime() . "
> </title>
> </head>
> <body>
> <table>";
>
> foreach my $articles ($xmlobj->children) { #get the top tag
> foreach my $article ($articles->children) { #get all articles
> my $file = $article->child('PUB')->value . '-' .
> $article->child('RUB')->value . '-' .
> $article->child('LEV')->value . '-' .
> $article->child('DAT')->value;
> $file =~ s/[^\w.-]//g; #remove anything not alphanumeric, _, -, or
> .
> open FH, ">$file" or die "Could not open $file:$!";
> print FH $article->child('BRO')->value;
> close FH;
> print
> "<tr><td>", $article->child('ORD')->value, "</td></tr>\n",
> "<tr><td>", $article->child('LEV')->value, "</td></tr>\n",
> "<tr><td>", $article->child('DAT')->value, "</td></tr>\n",
> "<tr><td>", $article->child('PUB')->value, "</td></tr>\n",
> "<tr><td><a href=\"$file\">", $article->child('RUB')->value,
> "</a></td></tr>\n","<tr><td>", $article->child('INL')->value,
> "</td></tr>\n",
> "<tr><td></td></tr>";
> }
> }
>
> print "
> </table>
> </body>
> </html>";
>
> close HTML;
> </code>
>
> On 19 Jun 2001 13:34:03 +0100, Nigel Wetters wrote:
> > I think I can give you some clues. Here's some code out of the Perl Cookbook (6.8
>Extracting a Range of Lines), which I've adapted for you. You should be able to nest
>such structures to get what you want.
> >
> > my $extracted_lines = '';
> > while (<>) {
> > if (/BEGIN PATTERN/ .. /END PATTERN/) {
> > # line falls between BEGIN and END in the
> > # text, inclusive
> > $extracted_lines .= $_;
> > } else {
> > # now, we're outside the pattern
> > process($extracted_lines) if $extracted_lines;
> > $extracted_lines = '';
> > }
> > }
> > sub process
> > {
> > # do stuff with the extracted lines
> > # maybe performing more regex's
> > }
> >
> > >>> Morgan <[EMAIL PROTECTED]> 06/19/01 01:12pm >>>
> > Hi
> >
> > I'm newbee perl developer and a rookie of xml :(
> >
> > Is there anyone who can give me some hints or help me out with a problem
> > I have?
> >
> > Here is the problem.
> > I will recive newsarticles three times a day in xml format and I need to
> > automaticly publish those articels on a web page, on the first page it
> > should only show the tags down to </INL>
> > tag and a link to the whole page.
> >
> > Here is a sample of the xml format.
> >
> > <ART>
> > <ORD>anbud</ORD>
> > <LEV>2001-06-14</LEV>
> > <DAT>14-06-01</DAT>
> > <PUB>DAGENS INDUSTRI</PUB>
> > <RUB>Dragkamp om f�rlustt�g</RUB>
> > <INL>Here is the indroduction about the article and when the word
> > anbud comes up it is enclosed in <HIT>anbud</HIT> tags.
> > This is the word we use as criteria on the articels we should recive.
> > </INL>
> > <BRO>
> > Here comes the rest of the document, thats the whole article.
> > The article ends with
> > </BRO>
> > </ART>
> >
> >
> > Raven
> >
> >
> >
> > This e-mail and any files transmitted with it are confidential
> > and solely for the use of the intended recipient.
> > ONdigital plc, 346 Queenstown Road, London SW8 4DG. Reg No: 3302715.
> >
> --
> Today is Setting Orange, the 24th day of Confusion in the YOLD 3167
> Wibble.