Holeman, Douglas wrote: > Bob, > > Thank you for the help. If I may ask you one more question, how do I > modify the randbits on a w32 install or do I have to recompile the > source to get the number higher. It is 15 right now so it stalls at > about 32563. Any help would be appreciated.
Doug, I'm not sure. I've redirected this back to the list in case somebody else can help. > > > Doug Holeman > > > > -----Original Message----- > From: Bob Showalter [mailto:[EMAIL PROTECTED] > Sent: Monday, September 29, 2003 8:24 PM > To: Holeman, Douglas; [EMAIL PROTECTED] > Subject: Re: Question on handling files. > > > Douglas Holeman wrote: > > I have had a request to pull 100,000+ names from a mailing list of > > 600,000+. I was hoping to figure out a way to get perl to help me > > do this. > > > > I stumbled through this code but the problem is I just wanted to > > know how to delete a line from the first list so I can be sure I > > dont pull the same 'random' line twice. > > > > Any help would be appreciated. I know this code sucks but I am not > > a programmer. > > > > > > > > > > open FEDOUT, ">c:\\perl\\bin\\129out.csv" or die "The Program has > > failed opening 129Out.csv: $!"; > > > > $x=1; > > $z=633856; > > print "1"; > > > > while ($x <= 129000) { > > open(FEDUP, "All.csv") or die "Can't find All.csv: $!\n"; > > > > $y = int(rand($z)+1); > > > > $a=1; > > > > > > while ($a < $y) { > > $line = <FEDUP>; > > # print $a," ",$y,"Z=", $z, "\n"; > > $a=$a+1; > > > > } > > > > > > print $x,"...", $y,"...", $z,"...", $a,"...", $line,"...","\n"; > > print (FEDOUT $line); $a=1; $y=1; > > > > $x=$x+1; > > close FEDUP; > > > > } > > > > print $x; > > close FEDOUT; > > close FEDUP; > > This code is going to be *extremely* slow, because you are > making many, many, many passes over the huge 600K row file. > > Here's a fairly efficient algorithm that requires only one pass > through the large file (after you know how many lines are in > it). Pass the name of the large file as the first command > line argument: > > #!/usr/bin/perl > use strict; > my $x = 633_856; # no. of rows in big file > my $y = 129_000; # no. of rows to select > my %h; > for (1..$y) { > print "$_\n" unless $_ % 1000; > my $i = int(rand($x)) + 1; > exists $h{$i} and redo; > $h{$i}++; > } > while(<>) { > exists $h{$.} and print; > } > > Here I'm basically creating a hash with 129k entries with keys > randomly selected from the range 1..633k. Then as we read > through the file, we just print each line whose number exists > as a key in the hash. > > Important: make sure your Perl has enough randbits. or you > won't get a distribution over this range. You'll need at least > 20 bits and preferrably 31 or more. Use "perl -V:randbits" to > check the figure. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]