Re: appending unique entries to a text file in perl

Jeff Pinyan Fri, 04 May 2001 09:39:06 -0700
On May 4, Jason Cruces said:

>1. Open two files, one to be read from (call it file1)
>and another to be written to (file2)

Ok, you've got this down fine.

>2. Both files will contain a list of entries (some
>will be duplicates). If file1 contains an entry that
>is not in file2, append it to file2. If the entry does
>exist in file2, move on and check the next entry (in
>file1).

Whenever you hear "duplicate" or "unique", think of a hash.  With a hash,
you can determine if you've seen a key already far faster than with an
array.

There's a Perl FAQ about removing duplicates (perldoc -q duplicate), but
it has to be modified for your purposes.

>open (FILE, ">>file2") || die "cannot open file2\n";
>open (OIDFILE, "file1") || die "cannot open file1\n";
>while (<OIDFILE>) {
>       $new_oid = $_;
>               while (<FILE>) {
>                       if (!/$new_oid/) {

You can't read from a filehandle if you've opened it for appending.  You'd
need to open it with "+>> ..." instead of ">> ...".

And here, you'd need to use /\Q$new_oid/, in case $new_oid has any special
regex characters.  But this isn't even good enough.  You'd need a regex
like

  /\A\Q$new_oid\E\z/

which matches EXACTLY the string in $new_oid and nothing else.  But that's
not what you should use a regex for.

This is more like:

  if ($_ ne $new_oid) { ... }


>                               print FILE "$new_oid\n";
>                       }
>               }
>}
>close (FILE) || die "cannot close file2";
>close (OIDFILE) || die "cannot close file1";

But the problem is that STILL, the logic will fail.

Here's why:

  @list = (1, 2, 3, 4, 5);
  @new = (6, 2);

  for $there (@list) {
    for $to_add (@new) {
      if ($to_add != $there) {
        # then $to_add must not be in @list, right?
        # wrong!
      }
    }
  }

You see, just because a SPECIFIC element isn't the one you're looking for,
that doesn't mean that none are.  To make sure something is NOT in a set,
you have to look through the entire set, or until you find it in there
already.

SO...

You have two files.  You want to add entries FROM one file TO the other
file if they aren't already in that destination file.

I think the easiest way to go about this is to build up a hash of the
destination file.

  my %seen;

  open DEST, "< dest.txt" or die "can't read dest.txt: $!";
  while (<DEST>) { $seen{$_} = 1 }  # build the hash of seen lines
  close DEST;

Now, we open the file for appending, and open the source file.

  open DEST, ">> dest.txt" or die "can't append to dest.txt: $!";
  open SRC, "< source.txt" or die "can't read source.txt: $!";

Now, we go through SRC line by line, and append to DEST if the line from
SRC is *not* in the hash.  And, just in case SRC has duplicates in itself,
we'll add these to the hash as we find them.

  while (<SRC>) {
    if (not $seen{$_}) {
      $seen{$_} = 1;
      print DEST;  # that defaults to printing $_
    }
  }

Now we're done.  Close up shop.

  close SRC;
  close DEST;

If you're interested in seeing how you might write this code later in your
Perl career, as you learn idioms and such, I offer:

  my %seen;

  open IN, $in or die "can't read $in: $!";
  open OUT, "+< $out" or die "can't read/write $out: $!";

  $seen{$_} = 1 while <OUT>;

  while (<IN>) { print OUT if not $seen{$_}++ }

  close OUT;
  close IN;

-- 
Jeff "japhy" Pinyan      [EMAIL PROTECTED]      http://www.pobox.com/~japhy/
Are you a Monk?  http://www.perlmonks.com/     http://forums.perlguru.com/
Perl Programmer at RiskMetrics Group, Inc.     http://www.riskmetrics.com/
Acacia Fraternity, Rensselaer Chapter.         Brother #734
Re: appending unique entries to a text file in perl

Reply via email to