Mohit_jain01 wrote:
>
> > From: Rob Dixon
> >
> > Mohit_jain01 wrote:
> > >
> > > I am facing a problem with text file manipulation with Perl.
> > >
> > > I have a file with over 2 lac lines of data.
> > > I need to find the duplicates(strings) in the file and copy those records into 
> > > another file.
> > >
> > > Is there a function/module  in Perl by which I can read the duplicates in a file 
> > > at one go and print them
> > > on to another file.
> >
> > Before we can help you we need to know a little more of your problem.
> >
> > Are you looking for duplicate lines in the file, or duplicate strings defined
> > in some other way? How big is the file you want to read (how many lines
> > or strings do you want to compare)?
> >
> > There are modules which will help you write your program, but exactly
> > how you go about it depends on the details of your problem.
>
> I have a big file containing about 200000 lines. This file basically contains some 
> records. A sample of the file is as given
> below:
>
> dn: cn=1148734,ou=Employees,dc=jci,dc=com
>
> displayname: Herek, Moriah L
>
> jdirlastfourssn: 2888
>
> dn: cn=1148735,ou=Employees,dc=jci,dc=com
>
> displayname: Pelletier, Michael J
>
> jdirlastfourssn: 8719
>
> uid: cpellem
>
[snip data]
>
> What I need to do is:
>
> 1. Take the first entry and get the value of the display name and jdirlastfourssn 
> attribute.
>
> 2. Check whether there is another record with the same display name attribute value.
>        (There cud be multiple records)
>
>  3. If so then extract both record and write them into another file.
>
>  4. Delete these duplicate records from the parent file.
>
> 5. Do that for all records.
>

I'm not clear whether you mean 200K lines or 200K records (which seem to be mostly
6 lines each except for 'Pelletier, Michael J' which has an additional pair of
lines for a 'uid' value. But, even if it were 200K records at about 100 characters
each this would be 20Mb, which is well within the capacity of all but the smallest
computers these days. This problem is far easier with all the data in memory, so
I'll go that way, and if you find it's not working or is too slow we'll think again.

OK, so let me rewrite your algorithm a little.

- Read all records into memory
- While there are records left
-   calculate a 'key' from the display name and serial number
-   find all records in the data with a matching key
-   if there was only one then print it to parent file, else print them to file 2
-   delete them from the list

>From the top:


"Read in all of the records"

It looks like all the information starts with a line beginning with 'dn:'. If this
is wrong we'll have to change it.

  my @records;

  while (<DATA>) {
    push @records, '' if /^dn:\s+/;
    $records[-1] .= $_;
  }


Before the loop, how about a subroutine which, given one of the multi-line records,
will return a key value containing the name and serial number. This picks out the
strings and concatenates them with a tab character in between, chosen because it
is unlikely to appear in the data itself.

  sub keyval {
    my $rec = shift;
    my ($name, $sn);
    ($name) = $rec =~ m/^displayname:\s+(.+)/m;
    ($sn) = $rec =~ m/^jdirlastfourssn:\s+(\d+)/m;
    join "\t", $name, $sn;
  }


"While there are records left" Here's the loop, including a couple of lines to
remove all non-blank entries from the beginning which will be left by the call
to 'delete' that you see in a moment.

  while (@records) {
    until (exists $records[0]) {
      shift @records;
    }
    :
  }


"Calculate a 'key' from the display name and serial number"

  my $key = keyval($records[0]);

"Find all records in the data with a matching key" This call to 'map'
returns a list of indices of all the records in the array which have
a matching key value. This is bound to include the index zero as the
first record matches itself. If there are more then the length of the
array will be more than 1.

  my $i = 0;
  my @slice = map {
    my @i = $i++;
    defined $_ and keyval($_) eq $key ? @i : ()
  } @records;


The next two steps together: "If there was only one then print it to
parent file, else print them to file 2. Delete them from the list"
The 'delete' function usefully returns a list of all the records
it deleted, so we can just print the results of deleting the array
slice.

  if (@slice == 1) {
    print PARENT delete @[EMAIL PROTECTED];
  }
  else {
    print FILE2 delete @[EMAIL PROTECTED];
  }


And you're done. Clearly you need to open filehandles DATA for read
and PARENT and FILE2 for write, but the program's there otherwise.

Let us know if you need help assembling the kit.

HTH,

Rob




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to