Douglas Holeman wrote: > I have had a request to pull 100,000+ names from a mailing list of > 600,000+. I was hoping to figure out a way to get perl to help me do > this. > > I stumbled through this code but the problem is I just wanted to know > how to delete a line from the first list so I can be sure I dont pull > the same 'random' line twice. > > Any help would be appreciated. I know this code sucks but I am not a > programmer. > > > > > open FEDOUT, ">c:\\perl\\bin\\129out.csv" or die "The Program has > failed opening 129Out.csv: $!"; > > $x=1; > $z=633856; > print "1"; > > while ($x <= 129000) { > open(FEDUP, "All.csv") or die "Can't find All.csv: $!\n"; > > $y = int(rand($z)+1); > > $a=1; > > > while ($a < $y) { > $line = <FEDUP>; > # print $a," ",$y,"Z=", $z, "\n"; > $a=$a+1; > > } > > > print $x,"...", $y,"...", $z,"...", $a,"...", $line,"...","\n"; > print (FEDOUT $line); > $a=1; > $y=1; > > $x=$x+1; > close FEDUP; > > } > > print $x; > close FEDOUT; > close FEDUP;
This code is going to be *extremely* slow, because you are making many, many, many passes over the huge 600K row file. Here's a fairly efficient algorithm that requires only one pass through the large file (after you know how many lines are in it). Pass the name of the large file as the first command line argument: #!/usr/bin/perl use strict; my $x = 633_856; # no. of rows in big file my $y = 129_000; # no. of rows to select my %h; for (1..$y) { print "$_\n" unless $_ % 1000; my $i = int(rand($x)) + 1; exists $h{$i} and redo; $h{$i}++; } while(<>) { exists $h{$.} and print; } Here I'm basically creating a hash with 129k entries with keys randomly selected from the range 1..633k. Then as we read through the file, we just print each line whose number exists as a key in the hash. Important: make sure your Perl has enough randbits. or you won't get a distribution over this range. You'll need at least 20 bits and preferrably 31 or more. Use "perl -V:randbits" to check the figure. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]