If you have ample of RAM, you may try using grep i.e.
open(SMALL,"<smallfile");
open(LARGE,"<largefile");
open(UNIQ,">newfile");
while(<LARGE>) {
        $lin=$_;
        print UNIQ if(!(grep {^$lin$} <SMALL>));
        seek(SMALL,0,0);
        }





>From: "Will W" <[EMAIL PROTECTED]>
>Reply-To: "Will W" <[EMAIL PROTECTED]>
>To: "Steve Whittle" <[EMAIL PROTECTED]>, 
><[EMAIL PROTECTED]>
>Subject: Re: Comparing two files
>Date: Sat, 9 Jun 2001 07:35:05 -0700
>
>If your system's memory is large enough to hold the smaller dataset,
>then as others have said, working with hashes is the way to go:
>
>     read all of small dataset into hash
>     while another record in large dataset
>         if key for record exists in hash
>             delete hash{key}                           << result is thus
>an XOR of keys
>         else
>              add record to hash
>     write hash as the output file
>
>If the amount of satellite data is too great for this approach to work,
>I would use a two stage approach by reading only the key fields and
>using "file1" and "file2" as the appropriate hash values. Then build the
>output file by looking up each record in described in the hash. There
>may be more efficient ways to do this, but I like to keep my thoughts
>simple.
>
>Will
>
>
>
>
>----- Original Message -----
>From: Steve Whittle <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Wednesday, June 06, 2001 8:46 AM
>Subject: Comparing two files
>
>
> > Hi,
> >
> > I'm trying to write a script that removes duplicates between two files
>and
> > writes the unique values to a new file. For example, have one file
>with the
> > following file 1:
> >
> > red
> > green
> > blue
> > black
> > grey
> >
> > and another file 2:
> >
> > black
> > red
> >
> > and I want to create a new file that contains:
> >
> > green
> > blue
> > grey
> >
> > I have written a script that takes each entry in file 1 and then reads
> > through file 2 to see if it exists there, if not, it writes it to a
>new
> > file. If there is a duplicate, nothing is written to the new file. The
>real
> > file 1 I'm dealing with has more than 2 million rows and the real file
>2 has
> > more than 100,000 rows so I don't think my method is very efficient.
>I've
> > looked through the web and perl references and can't find an easier
>way. Am
> > I missing something? Any ideas?
> >
> > Thanks,
> >
> > Steve Whittle
> >
> >
>

_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

Reply via email to