Douglas Holeman wrote:
> I have had a request to pull 100,000+ names from a mailing list of
> 600,000+. I was hoping to figure out a way to get perl to help me do
> this.
>
> I stumbled through this code but the problem is I just wanted to know
> how to delete a line from the first list so I can be sure I dont pull
> the same 'random' line twice.
>
> Any help would be appreciated.  I know this code sucks but I am not a
> programmer.
>
>
>
>
> open FEDOUT, ">c:\\perl\\bin\\129out.csv" or die "The Program has
> failed opening 129Out.csv: $!";
>
> $x=1;
> $z=633856;
> print "1";
>
> while ($x <= 129000) {
> open(FEDUP, "All.csv") or die "Can't find All.csv: $!\n";
>
>  $y = int(rand($z)+1);
>
>  $a=1;
>
>
>  while ($a < $y) {
>   $line = <FEDUP>;
> #  print $a,"          ",$y,"Z=", $z, "\n";
>   $a=$a+1;
>
>  }
>
>
>  print $x,"...", $y,"...", $z,"...", $a,"...", $line,"...","\n";
>  print (FEDOUT $line);
>  $a=1;
>  $y=1;
>
>  $x=$x+1;
> close FEDUP;
>
> }
>
> print $x;
> close FEDOUT;
> close FEDUP;

This code is going to be *extremely* slow, because you are making many,
many, many passes over the huge 600K row file.

Here's a fairly efficient algorithm that requires only one pass through the
large file (after you know how many lines are in it). Pass the name of the
large file as the first command line argument:

  #!/usr/bin/perl
  use strict;
  my $x = 633_856;        # no. of rows in big file
  my $y = 129_000;        # no. of rows to select
  my %h;
  for (1..$y) {
  print "$_\n" unless $_ % 1000;
      my $i = int(rand($x)) + 1;
      exists $h{$i} and redo;
      $h{$i}++;
  }
  while(<>) {
      exists $h{$.} and print;
  }

Here I'm basically creating a hash with 129k entries with keys randomly
selected from the range 1..633k. Then as we read through the file, we just
print each line whose number exists as a key in the hash.

Important: make sure your Perl has enough randbits. or you won't get a
distribution over this range. You'll need at least 20 bits and preferrably
31 or more. Use "perl -V:randbits" to check the figure.



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to