A quick suggestion is to put the headlines in a hash and subsequently check
if they exist, i.e. something like the following:

if  (  not exists $headlinelist{$headline} )  {
     &do _your_stuff;
}




                                                                                       
                                       
                    Gary Nielson                                                       
                                       
                    <[EMAIL PROTECTED]>                To:     
[EMAIL PROTECTED]            
                    Sent by:                                     cc:                   
                                       
                    [EMAIL PROTECTED]        Subject:     Help 
getting rid of duplicate entires in a      
                    eState.com                                   datafile              
                                       
                                                                                       
                                       
                                                                                       
                                       
                    02/19/01 04:01 PM                                                  
                                       
                                                                                       
                                       
                                                                                       
                                       


I think I have written myself into a box (again!) I though the logic of
the following makes sense but it isn't doing what I want to do.

As some of you may know from reading this list, I have been writing a
script to process headlines by section. Thanks to the help of many of you,
the program is working great. But I found a new twist: Duplicate
headlines. Sometimes duplicate headlines, with DIFFERENT urls, come over,
or stories that are updated but basically say the same thing, come over.
I want to use the first instance of the headline (the one most at the top
of the file is the most recent) and delete all other instances of a
headline where the first two words are the same as that first
instance. (Explained below.)

Well, I wrote a way to get rid of duplicates. My problem is the program is
getting rid of ALL instances of the headline rather than just the 2nd and
later instances of it. I don't know why.

Here is the script, in its entirety, comments interspersed, and an
example data file. (I hesitate to post such a long email, but have been
told its much more helpful to post the entire script):

#!/usr/bin/perl -w

use strict;
use diagnostics;
use vars qw(@A);
use FileHandle;

# edit config here

my $location_of_templates = "./tmp";
my $location_of_input_file = "./tmp";
my $location_of_output_file = "./tmp";
my $site_location = "/tmp";
my @sections = ('Nation','Business','Life','Sports');

# end edit config

# variables initialized

my $section;
my $outputfile;
my $field;
my @fields;
my $url;
my $headline;
my $summary;
my $filename;
my $old;
my $new;
my $bak;
my @array;
my $check_for;
my @allheadlines;
my $pattern;
my $found;

# begin program

foreach $section (@sections)
   {
           # we're going to write the file in place, thanks to an example
in
           # Perl Cookbook
        $old = "$location_of_output_file\/$section.output.txt";
        $new = "$location_of_output_file\/$section.output.txt.tmp.$$";
        $bak = "$location_of_output_file\/$section.output.txt.orig";
        open(OLD, "< $old")         or die "can't open $old: $!";
        open(NEW, "> $new")         or die "can't open $new: $!";
        @fields = <OLD>;
                foreach $field (@fields)
                {
                     #split each line
                ($section,$url,$headline,$summary)=split('\|',$field);
                     # i need to create another variable for headline
because i
                     # want to get just the first two words. sometimes
                     # headlines come over with changes toward the end of
the
                     # headline, but the first two or three words are the
                     # same. It's basically the same story.

                $pattern = $headline;
                   # split the headline pattern up by spaces. I tried to

                # write a regular expression to do this, had trouble and
                     # this at least works
                @array = split(' ',$pattern);
                     # grab those first two words and put in a check for
variable
                   $check_for = "$array[0] $array[1]";
                     # my understanding is the only way to check through an
                     # array for a match is to loop through. This foreach
loop
                     # runs through an array called @allheadlines that I do
not
                   # push into until the end because I don't want to
                     # initialize with the first headline and check for
that
                     # and then delete it. That would not be good to delete
                     # first headlines in the lists.
                          foreach (@allheadlines)
                          {
                               # if there's a match on a line, set $found
to
                               # empty
                               if ($_ =~ m/$check_for/i) {  print "Dup:
$_\n"; $found = ""; }
                          }
           # if headline not empty, print to the new file the fields
           if ($found ne "" ) { chomp($summary);    print NEW join('
|',$section,$url,$headline,$summary),"\n"; }
        $found = $headline;
         chomp($headline);
           # this is where I put the headline in the array for checking for
           # duplicate first two words
         push(@allheadlines, $headline);
        }
        close(OLD)                  or die "can't close $old: $!";
        close(NEW)                  or die "can't close $new: $!";
        rename($old, $bak)          or die "can't rename $old to $bak: $!";
        rename($new, $old)          or die "can't rename $new to $old: $!";

}

If I have a file that looks like the following, (note I put in
purposefully several of the same headline at the top and down in the
list, too), I want the first instance at the top of the file to be used,
and all other instances purged. What is happening is ALL instances are
being purged:

Sports|/rc/sports/docs/07201628.htm|NASCAR Still in Shock Over Loss of
Legendary Earnhardt|DAYTONA, Fla. (Reuters) - On the day after arguably
NASCAR's greatest driver was killed at the sport's premier event, Dale
Earnhardt's friends and colleagues continued to react to Sunday's tragic
event.
Sports|/rc/sports/docs/08201628.htm|NASCAR Still in Shock Over Loss of
Legendary Earnhardt|DAYTONA, Fla. (Reuters) - On the day after arguably
NASCAR's greatest driver was killed at the sport's premier event, Dale
Earnhardt's friends and colleagues continued to react to Sunday's tragic
event.
Sports|/rc/sports/docs/09201628.htm|NASCAR Still in Shock Over Loss of
Legendary Earnhardt|DAYTONA, Fla. (Reuters) - On the day after arguably
NASCAR's greatest driver was killed at the sport's premier event, Dale
Earnhardt's friends and colleagues continued to react to Sunday's tragic
event.
Sports|/rc/sports/docs/07201585.htm|Record Scoring Trend in Golf Traced to
Tiger|LA QUINTA, Calif. (Reuters) - Scoring records are falling left and
right on the U.S. PGA Tour during the early season, and some players trace
the phenomenon to the phenomenal Tiger Woods.
Sports|/rc/sports/docs/07201171.htm|Magic's Garrity Needs Knee
Surgery|ORLANDO, Fla. (Reuters) - Orlando Magic forward Pat Garrity is
expected to miss two to three weeks after knee surgery to be performed on
Tuesday.
Sports|/rc/sports/docs/07200113.htm|Cubs, Kerry Wood Agree on One-Year
Contract|MESA, Ariz. (Reuters) - The Chicago Cubs, hoping that Kerry Wood
will return to his rookie form of 1998, avoided arbitration with the prized
right-hander Monday by agreeing to terms on a one-year contract.
Sports|/rc/sports/docs/07200088.htm|Daytona 500 Posts Record Rating|NEW
YORK (AP) - Fox's coverage of Sunday's Daytona 500 race earned an 8.4
rating and 19 share, the highest overnight rating for the NASCAR race since
1986.
Sports|/rc/sports/docs/07199607.htm|Veerpalu Wins 30K at Nordic
Skiing|LAHTI, Finland (AP) - Estonia's Andrus Veerpalu edged Norway's Frode
Estil by two-tenths of a second Monday in a 30-kilometer race - one of the
closest finishes in the history of the Nordic world championships.
Sports|/rc/sports/docs/07199195.htm|Olympic Inspectors Visit China|BEIJING
(AP) - Marring Beijing's efforts to present its best face for Olympic
inspectors, police kept watch on jailed dissidents' families Monday and
environmentalists criticized officials for dyeing lawns green.
Sports|/rc/sports/docs/07199190.htm|Monday's Ski Report|LEBANON, N.H. (AP)
- Latest skiing conditions as supplied by SnoCountry Mountain Reports.
Skiing conditions are subject to change due to weather, skier traffic and
other factors. Be aware of changing conditions:<
Sports|/rc/sports/docs/07199003.htm|Veerpalu Wins 30K at Nordic
Skiing|LAHTI, Finland (AP) - Estonia's Andrus Veerpalu edged Norway's Frode
Estil by two-tenths of a second Monday in a 30-kilometer race - one of the
closest finishes in the history of the Nordic world championships.
Sports|/rc/sports/docs/07197082.htm|Miller Shoots Down Lakers As Pacers
Win|INDIANAPOLIS (Reuters) - Kobe Bryant and Shaquille O'Neal may be the
most powerful one-two punch in the NBA, but silky smooth Reggie Miller
remains the best clutch shooter and proved it Sunday in the Pacers'
exciting win over the Lakers.
Sports|/rc/sports/docs/09201628.htm|NASCAR Still in Shock Over Loss of
Legendary Earnhardt|DAYTONA, Fla. (Reuters) - On the day after arguably
NASCAR's greatest driver was killed at the sport's premier event, Dale
Earnhardt's friends and colleagues continued to react to Sunday's tragic
event.
Sports|/rc/sports/docs/07197077.htm|Miller Shoots Down Lakers As Pacers
Win|INDIANAPOLIS (Reuters) - Kobe Bryant and Shaquille O'Neal may be the
most powerful one-two punch in the NBA, but silky smooth Reggie Miller
remains the best clutch shooter and proved it Sunday in the Pacers'
exciting win over the Lakers.
Sports|/rc/sports/docs/07197072.htm|Holmstrom Lifts Red Wings Past
Stars|DALLAS (Reuters) - Tomas Holmstrom set up one power-play goal and
scored another as the Detroit Red Wings extended their unbeaten streak to
seven games with a 2-1 victory over the Dallas Stars.
Sports|/rc/sports/docs/07197002.htm|Szabo Breaks World Running
Record|BIRMINGHAM, England (AP) - Gabriela Szabo took almost a second off a
12-year-old world record and warned her rivals there's more to come.
Sports|/rc/sports/docs/07194908.htm|'The Intimidator' Dies Behind Wheel at
Daytona|DAYTONA BEACH, Fla. (Reuters) - Dale Earnhardt Sr., killed in a
crash on the last lap of the Daytona 500 on Sunday, dressed in black, drove
a black race car and was known as ``The Intimidator.''
Sports|/rc/sports/docs/07194750.htm|Earnhardt's Death Mars Michael
Waltrip's Daytona 500 Win|DAYTONA BEACH (Reuters) - On a day that saw
NASCAR suffer its biggest loss, Michael Waltrip collected his biggest win,
taking Sunday's 43rd Daytona 500.
Sports|/rc/sports/docs/07194489.htm|Lovellon Wins $200G at Santa
Anita|ARCADIA, Calif. (AP) - Lovellon rallied to the lead on the last turn
and beat Feverish in a neck-to-neck race to win Sunday's $200,000 Santa
Maria handicap at Santa Anita Park.
Sports|/rc/sports/docs/07194374.htm|Chicago Blanks Los Angeles, 3|CHICAGO
(Reuters) - Jocelyn Thibault made 19 saves for his fifth shutout of the
season as the Chicago Blackhawks posted a rare win over the Los Angeles
Kings, 3-0.
Sports|/rc/sports/docs/07194361.htm|Gilder Claims His First Senior
Tournament Win|LUTZ, Fla. (Reuters) - Bob Gilder, making just his third
official Senior Tour start, shot a four-under-par 67 Sunday to win the
Verizon Classic by three strokes at the TPC of Tampa Bay.
Sports|/rc/sports/docs/07194355.htm|Sugiyama Upset by Kandarr in Women's
Event|OKLAHOMA CITY (Reuters) - Sixth-seeded Ai Sugiyama of Japan was upset
by Jana Kandarr of Germany, 1-6, 7-5, 7-5, Sunday in the opening day of the
$170,000 IGA U.S. Indoor Championship.
Sports|/rc/sports/docs/01201628.htm|NASCAR Still in Shock Over Loss of
Legendary Earnhardt|DAYTONA, Fla. (Reuters) - On the day after arguably
NASCAR's greatest driver was killed at the sport's premier event, Dale
Earnhardt's friends and colleagues continued to react to Sunday's tragic
event.
Sports|/rc/sports/docs/07194141.htm|Clemson Upsets Top-Ranked UNC,
75-65|CLEMSON (Reuters) - The only thing harder to believe than North
Carolina losing to Clemson is the Tar Heels being done in by a Chapel Hill
native.
Sports|/rc/sports/docs/07193985.htm|Durant Sets PGA Tour Scoring Mark in
Hope Romp|LA QUINTA, Calif. (Reuters) - Joe Durant capped off a
record-breaking week of scoring in style Sunday by shooting a
seven-under-par 65 to win the 90-hole Bob Hope Classic with a PGA Tour
record total of 36-under-par 324.
Sports|/rc/sports/docs/07193992.htm|Stanford Returns to No. 1 in Latest
USA/ESPN Hoops Poll|ARLINGTON (Reuters) - Stanford returned to No. 1 in the
latest USA Today/ESPN college basketball poll, which was released Sunday
night.
Sports|/rc/sports/docs/07193988.htm|Clemson Upsets Top-Ranked UNC,
75-65|CLEMSON (Reuters) - The only thing harder to believe than North
Carolina losing to Clemson is the Tar Heels being done in by a Chapel Hill
native.
Sports|/rc/sports/docs/07193977.htm|Durant Breaks 90-Hole Scoring Record,
Wins Hope Classic|LA QUINTA (Reuters) - Joe Durant set tournament records
throughout the Bob Hope Classic and ultimately set a PGA Tour record
Sunday.
Sports|/rc/sports/docs/07193982.htm|Indiana Edges Los Angeles,
110-109|INDIANAPOLIS (Reuters) - There are many who believe Kobe Bryant is
the NBA's best player. However, Reggie Miller remains the best clutch
shooter.



--
Gary Nielson
[EMAIL PROTECTED]




_______________________________________________
Perl-Win32-Users mailing list
[EMAIL PROTECTED]
http://listserv.ActiveState.com/mailman/listinfo/perl-win32-users




_______________________________________________
Perl-Win32-Users mailing list
[EMAIL PROTECTED]
http://listserv.ActiveState.com/mailman/listinfo/perl-win32-users

Reply via email to