Re: Help required.....about string/text manipulation

Chinku Simon Sat, 14 Jun 2003 21:47:37 -0700

Hi,

I wud like some help in assembling the kit.


Thanks in Advance

--- Rob Dixon <[EMAIL PROTECTED]> wrote:
> Mohit_jain01 wrote:
> >
> > > From: Rob Dixon
> > >
> > > Mohit_jain01 wrote:
> > > >
> > > > I am facing a problem with text file manipulation with Perl.
> > > >
> > > > I have a file with over 2 lac lines of data.
> > > > I need to find the duplicates(strings) in the file and copy those records into 
> > > > another
> file.
> > > >
> > > > Is there a function/module  in Perl by which I can read the duplicates in a 
> > > > file at one go
> and print them
> > > > on to another file.
> > >
> > > Before we can help you we need to know a little more of your problem.
> > >
> > > Are you looking for duplicate lines in the file, or duplicate strings defined
> > > in some other way? How big is the file you want to read (how many lines
> > > or strings do you want to compare)?
> > >
> > > There are modules which will help you write your program, but exactly
> > > how you go about it depends on the details of your problem.
> >
> > I have a big file containing about 200000 lines. This file basically contains some 
> > records. A
> sample of the file is as given
> > below:
> >
> > dn: cn=1148734,ou=Employees,dc=jci,dc=com
> >
> > displayname: Herek, Moriah L
> >
> > jdirlastfourssn: 2888
> >
> > dn: cn=1148735,ou=Employees,dc=jci,dc=com
> >
> > displayname: Pelletier, Michael J
> >
> > jdirlastfourssn: 8719
> >
> > uid: cpellem
> >
> [snip data]
> >
> > What I need to do is:
> >
> > 1. Take the first entry and get the value of the display name and jdirlastfourssn 
> > attribute.
> >
> > 2. Check whether there is another record with the same display name attribute 
> > value.
> >        (There cud be multiple records)
> >
> >  3. If so then extract both record and write them into another file.
> >
> >  4. Delete these duplicate records from the parent file.
> >
> > 5. Do that for all records.
> >
> 
> I'm not clear whether you mean 200K lines or 200K records (which seem to be mostly
> 6 lines each except for 'Pelletier, Michael J' which has an additional pair of
> lines for a 'uid' value. But, even if it were 200K records at about 100 characters
> each this would be 20Mb, which is well within the capacity of all but the smallest
> computers these days. This problem is far easier with all the data in memory, so
> I'll go that way, and if you find it's not working or is too slow we'll think again.
> 
> OK, so let me rewrite your algorithm a little.
> 
> - Read all records into memory
> - While there are records left
> -   calculate a 'key' from the display name and serial number
> -   find all records in the data with a matching key
> -   if there was only one then print it to parent file, else print them to file 2
> -   delete them from the list
> 
> From the top:
> 
> 
> "Read in all of the records"
> 
> It looks like all the information starts with a line beginning with 'dn:'. If this
> is wrong we'll have to change it.
> 
>   my @records;
> 
>   while (<DATA>) {
>     push @records, '' if /^dn:\s+/;
>     $records[-1] .= $_;
>   }
> 
> 
> Before the loop, how about a subroutine which, given one of the multi-line records,
> will return a key value containing the name and serial number. This picks out the
> strings and concatenates them with a tab character in between, chosen because it
> is unlikely to appear in the data itself.
> 
>   sub keyval {
>     my $rec = shift;
>     my ($name, $sn);
>     ($name) = $rec =~ m/^displayname:\s+(.+)/m;
>     ($sn) = $rec =~ m/^jdirlastfourssn:\s+(\d+)/m;
>     join "\t", $name, $sn;
>   }
> 
> 
> "While there are records left" Here's the loop, including a couple of lines to
> remove all non-blank entries from the beginning which will be left by the call
> to 'delete' that you see in a moment.
> 
>   while (@records) {
>     until (exists $records[0]) {
>       shift @records;
>     }
>     :
>   }
> 
> 
> "Calculate a 'key' from the display name and serial number"
> 
>   my $key = keyval($records[0]);
> 
> "Find all records in the data with a matching key" This call to 'map'
> returns a list of indices of all the records in the array which have
> a matching key value. This is bound to include the index zero as the
> first record matches itself. If there are more then the length of the
> array will be more than 1.
> 
>   my $i = 0;
>   my @slice = map {
>     my @i = $i++;
>     defined $_ and keyval($_) eq $key ? @i : ()
>   } @records;
> 
> 
> The next two steps together: "If there was only one then print it to
> parent file, else print them to file 2. Delete them from the list"
> The 'delete' function usefully returns a list of all the records
> it deleted, so we can just print the results of deleting the array
> slice.
> 
>   if (@slice == 1) {
>     print PARENT delete @[EMAIL PROTECTED];
>   }
>   else {
>     print FILE2 delete @[EMAIL PROTECTED];
>   }
> 
> 
> And you're done. Clearly you need to open filehandles DATA for read
> and PARENT and FILE2 for write, but the program's there otherwise.
> 
> Let us know if you need help assembling the kit.
> 
> HTH,
> 
> Rob
> 
> 
> 
> 
> -- 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Help required.....about string/text manipulation

Reply via email to