Am Freitag, 27. Mai 2005 13.56 schrieb Jack Daniels (Butch): > Wow, I'm really confused. I'm trying to remove duplicate lines from a > marc21 text file. I have spent countless hours searching for scripts etc. > > What I find frustrating while trying to learn Perl, is that most solutions > assume you know what to do. For example, someone gives the code to find > and replace, and that's it. In other words, if the complete script was > there, I think I could learn much faster. I have no idea of how to put the > code into a script. > > I did manage to find a few perl one liners but it removed the blank lines > between the records, which must be retained in order to convert the file > back to actual marc format before downloading into the database. > > It also removed non sequential lines if they were the same in another > record. They must also be kept as they are an important part of the file. > > Any help would be more than appreciated. Below is part of a very large > file.Approx 100,000 records need to be processed. For now, I just want to > remove adjacent duplicate fields. > > =LDR 01548cam 2200397La 45{92}0 > =001 ocm42328427\ > =003 OCoLC > =005 20010526091201.0 > =006 m\\\\\\\\u\\\\\\\\ > =007 cr\cn- > =008 831108s1984\\\\inua\\\\sb\\\\001\0\eng\d > =010 \\$z 83048636 > =035 \\1234 (sirsi) > =035 \\1234 (sirsi) > =040 \\$aN{dollar}T$cN{dollar}T$dOCL > =020 \\$a0585000905 (electronic bk.) > =020 \\$z0253366062 > =020 \\$z0253203252 > =050 14$aNX180.F4$bL38 1984eb > =082 04$a700/.88042$219 > =049 [EMAIL PROTECTED] > =100 1\$aLauter, Estella,$d1940- > =245 10$aWomen as mythmakers$h[computer file] :$bpoetry and visual art by > twentieth-century women /$cEstella Lauter. =260 \\$aBloomington :$bIndiana > University Press,$cc1984. > =300 \\$axvii, 267 p. :$bill. ;$c24 cm. > =504 \\$aBibliography: p. 247-260. > =500 \\$aIncludes index. > =533 \\$aElectronic reproduction.$bBoulder, Colo. > :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in > multiple electronic file formats.$nAccess may be limited to NetLibrary > affiliated libraries. =SUBJ \0$aFeminism and the arts. > =SUBJ \0$aWomen artists. > =SUBJ \0$aWomen poets. > =SUBJ \0$aArt and mythology. > =SUBJ \0$aArts, Modern$y20th century. > =655 \7$aElectronic books.$2local > =710 2\$aNetLibrary, Inc. > =776 1\$cOriginal$w(DLC) 83048636$w(OCoLC)10162146 > =856 4\$3Bibliographic record > display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v=1&bookid=652 >$zAn electronic book accessible through the World Wide Web; click for > information =994 \\$a92$bM7@ > > =LDR 01470cam 2200349La 45{92}0 > =001 ocm42328450\ > =003 OCoLC > =005 20010526091202.0 > =006 m\\\\\\\\u\\\\\\\\ > =007 cr\cn- > =008 980609s1998\\\\couab\\\sbf\\\001\0\eng\d > =010 \\$z 98026266 > =035 \\1234 (sirsi) > =035 \\1234 (sirsi) > =040 \\$aN{dollar}T$cN{dollar}T$dOCL > =020 \\$a0585001413 (electronic bk.) > =020 \\$z1555662307 > =050 14$aQB581$b.L66 1998eb > =082 04$a523.3$221 > =049 [EMAIL PROTECTED] > =100 1\$aLong, Kim. > =245 14$aThe moon book$h[computer file] :$bfascinating facts about the > magnificent, mysterious moon /$cKim Long ; science advisor, Larry Sessions. > =250 \\$aRev. and expanded. > =260 \\$aBoulder, Colo. :$bJohnson Books,$cc1998. > =300 \\$a149 p. :$bill., maps ;$c22 cm. > =500 \\$aIncludes 1 errata sheet. > =504 \\$aIncludes bibliographical references (p. 132-133) and index. > =533 \\$aElectronic reproduction.$bBoulder, Colo. > :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in > multiple electronic file formats.$nAccess may be limited to NetLibrary > affiliated libraries. =651 \0$aMoon$vHandbooks, manuals, etc. > =655 \7$aElectronic books.$2local > =710 2\$aNetLibrary, Inc. > =776 1\$cOriginal$w(DLC) 98026266$w(OCoLC)39299241 > =856 4\$3Bibliographic record > display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v=1&bookid=140 >$zAn electronic book accessible through the World Wide Web; click for > information =994 \\$a92$bM7@ > =994 \\$a92$bM7@
Ok, the following script does the following: If any adjacent line occurs multiple times, it just prints one to stdout. To call the script with an input file and write the result in an outputfile: $ ./test10.pl inputfile > outputfile The script (save it to test10.pl or whatever): [BEGIN] # not part of the script #!/usr/bin/perl my $last; # line before actual line while (<ARGV>) { # read line from input file untile end of file # don't print line if last line is the same print unless $_ eq $last; # assign current line to $last for # comparison of the next line $last=$_; } [END] # not part of the script I have tested the script and it removes (line number of input in 1st column): 10: =035 \\1234 (sirsi) 45: =035 \\1234 (sirsi) 66: =994 \\$a92$bM7@ HTH and ask if you have further questions joe -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>