Am Freitag, 27. Mai 2005 13.56 schrieb Jack Daniels (Butch):
> Wow, I'm really confused. I'm trying to remove duplicate lines from a
> marc21 text file.  I have spent countless hours searching for scripts etc.
>
> What I find frustrating while trying to learn Perl, is that most solutions
> assume you know what to do.  For example, someone gives the code to find
> and replace, and that's it. In other words, if the complete script was
> there, I think I could learn much faster. I have no idea of how to put the
> code into a script.
>
> I did manage to find a few perl one liners but it removed the blank lines
> between the records, which must be retained in order to convert the file
> back to actual marc format before downloading into the database.
>
> It also removed non sequential lines if they were the same in another
> record.  They must also be kept as they are an important part of the file.
>
> Any help would be more than appreciated. Below is part of a very large
> file.Approx 100,000 records need to be processed. For now, I just want to
> remove adjacent duplicate fields.
>
> =LDR  01548cam  2200397La 45{92}0
> =001  ocm42328427\
> =003  OCoLC
> =005  20010526091201.0
> =006  m\\\\\\\\u\\\\\\\\
> =007  cr\cn-
> =008  831108s1984\\\\inua\\\\sb\\\\001\0\eng\d
> =010  \\$z   83048636
> =035  \\1234 (sirsi)
> =035  \\1234 (sirsi)
> =040  \\$aN{dollar}T$cN{dollar}T$dOCL
> =020  \\$a0585000905 (electronic bk.)
> =020  \\$z0253366062
> =020  \\$z0253203252
> =050  14$aNX180.F4$bL38 1984eb
> =082  04$a700/.88042$219
> =049  [EMAIL PROTECTED]
> =100  1\$aLauter, Estella,$d1940-
> =245  10$aWomen as mythmakers$h[computer file] :$bpoetry and visual art by
> twentieth-century women /$cEstella Lauter. =260  \\$aBloomington :$bIndiana
> University Press,$cc1984.
> =300  \\$axvii, 267 p. :$bill. ;$c24 cm.
> =504  \\$aBibliography: p. 247-260.
> =500  \\$aIncludes index.
> =533  \\$aElectronic reproduction.$bBoulder, Colo.
> :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in
> multiple electronic file formats.$nAccess may be limited to NetLibrary
> affiliated libraries. =SUBJ  \0$aFeminism and the arts.
> =SUBJ  \0$aWomen artists.
> =SUBJ  \0$aWomen poets.
> =SUBJ  \0$aArt and mythology.
> =SUBJ  \0$aArts, Modern$y20th century.
> =655  \7$aElectronic books.$2local
> =710  2\$aNetLibrary, Inc.
> =776  1\$cOriginal$w(DLC)   83048636$w(OCoLC)10162146
> =856  4\$3Bibliographic record
> display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v=1&bookid=652
>$zAn electronic book accessible through the World Wide Web; click for
> information =994  \\$a92$bM7@
>
> =LDR  01470cam  2200349La 45{92}0
> =001  ocm42328450\
> =003  OCoLC
> =005  20010526091202.0
> =006  m\\\\\\\\u\\\\\\\\
> =007  cr\cn-
> =008  980609s1998\\\\couab\\\sbf\\\001\0\eng\d
> =010  \\$z   98026266
> =035  \\1234 (sirsi)
> =035  \\1234 (sirsi)
> =040  \\$aN{dollar}T$cN{dollar}T$dOCL
> =020  \\$a0585001413 (electronic bk.)
> =020  \\$z1555662307
> =050  14$aQB581$b.L66 1998eb
> =082  04$a523.3$221
> =049  [EMAIL PROTECTED]
> =100  1\$aLong, Kim.
> =245  14$aThe moon book$h[computer file] :$bfascinating facts about the
> magnificent, mysterious moon /$cKim Long ; science advisor, Larry Sessions.
> =250  \\$aRev. and expanded.
> =260  \\$aBoulder, Colo. :$bJohnson Books,$cc1998.
> =300  \\$a149 p. :$bill., maps ;$c22 cm.
> =500  \\$aIncludes 1 errata sheet.
> =504  \\$aIncludes bibliographical references (p. 132-133) and index.
> =533  \\$aElectronic reproduction.$bBoulder, Colo.
> :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in
> multiple electronic file formats.$nAccess may be limited to NetLibrary
> affiliated libraries. =651  \0$aMoon$vHandbooks, manuals, etc.
> =655  \7$aElectronic books.$2local
> =710  2\$aNetLibrary, Inc.
> =776  1\$cOriginal$w(DLC)   98026266$w(OCoLC)39299241
> =856  4\$3Bibliographic record
> display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v=1&bookid=140
>$zAn electronic book accessible through the World Wide Web; click for
> information =994  \\$a92$bM7@
> =994  \\$a92$bM7@

Ok, the following script does the following:
If any adjacent line occurs multiple times, it just prints one to stdout.

To call the script with an input file and write the result in an outputfile:
$ ./test10.pl inputfile > outputfile

The script (save it to test10.pl or whatever):

[BEGIN] # not part of the script
#!/usr/bin/perl

my $last; # line before actual line

while (<ARGV>) { # read line from input file untile end of file
        # don't print line if last line is the same
        print unless $_ eq $last;
        # assign current line to $last for
        # comparison of the next line
        $last=$_;
}
[END] # not part of the script


I have tested the script and it removes (line number of input in 1st column):

10: =035  \\1234 (sirsi)
45: =035  \\1234 (sirsi)
66: =994  \\$a92$bM7@


HTH and ask if you have further questions

joe

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to