RE: Question on handling files.

Bob Showalter Tue, 30 Sep 2003 07:39:30 -0700

Holeman, Douglas wrote:
> Bob,
> 
> Thank you for the help. If I may ask you one more question, how do I
> modify the randbits on a w32 install or do I have to recompile the
> source to get the number higher. It is 15 right now so it stalls at
> about 32563.  Any help would be appreciated.


Doug, I'm not sure. I've redirected this back to the list in case somebody
else can help.

> 
> 
> Doug Holeman
> 
> 
> 
> -----Original Message-----
> From: Bob Showalter [mailto:[EMAIL PROTECTED]
> Sent: Monday, September 29, 2003 8:24 PM
> To: Holeman, Douglas; [EMAIL PROTECTED]
> Subject: Re: Question on handling files.
> 
> 
> Douglas Holeman wrote:
> > I have had a request to pull 100,000+ names from a mailing list of
> > 600,000+. I was hoping to figure out a way to get perl to help me
> > do this. 
> > 
> > I stumbled through this code but the problem is I just wanted to
> > know how to delete a line from the first list so I can be sure I
> > dont pull the same 'random' line twice. 
> > 
> > Any help would be appreciated.  I know this code sucks but I am not
> > a programmer. 
> > 
> > 
> > 
> > 
> > open FEDOUT, ">c:\\perl\\bin\\129out.csv" or die "The Program has
> > failed opening 129Out.csv: $!";
> > 
> > $x=1;
> > $z=633856;
> > print "1";
> > 
> > while ($x <= 129000) {
> > open(FEDUP, "All.csv") or die "Can't find All.csv: $!\n";
> > 
> >  $y = int(rand($z)+1);
> > 
> >  $a=1;
> > 
> > 
> >  while ($a < $y) {
> >   $line = <FEDUP>;
> > #  print $a,"          ",$y,"Z=", $z, "\n";
> >   $a=$a+1;
> > 
> >  }
> > 
> > 
> >  print $x,"...", $y,"...", $z,"...", $a,"...", $line,"...","\n";
> >  print (FEDOUT $line);  $a=1; $y=1;
> > 
> >  $x=$x+1;
> > close FEDUP;
> > 
> > }
> > 
> > print $x;
> > close FEDOUT;
> > close FEDUP;
> 
> This code is going to be *extremely* slow, because you are
> making many, many, many passes over the huge 600K row file.
> 
> Here's a fairly efficient algorithm that requires only one pass
> through the large file (after you know how many lines are in
> it). Pass the name of the large file as the first command
> line argument:
> 
>   #!/usr/bin/perl
>   use strict;
>   my $x = 633_856;        # no. of rows in big file
>   my $y = 129_000;        # no. of rows to select
>   my %h;
>   for (1..$y) {
>   print "$_\n" unless $_ % 1000;
>       my $i = int(rand($x)) + 1;
>       exists $h{$i} and redo;
>       $h{$i}++;
>   }
>   while(<>) {
>       exists $h{$.} and print;
>   }
> 
> Here I'm basically creating a hash with 129k entries with keys
> randomly selected from the range 1..633k. Then as we read
> through the file, we just print each line whose number exists
> as a key in the hash.
> 
> Important: make sure your Perl has enough randbits. or you
> won't get a distribution over this range. You'll need at least
> 20 bits and preferrably 31 or more. Use "perl -V:randbits" to
> check the figure.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Question on handling files.

Reply via email to