RE: Question on handling files.
Holeman, Douglas wrote: > Bob, > > Thank you for the help. If I may ask you one more question, how do I > modify the randbits on a w32 install or do I have to recompile the > source to get the number higher. It is 15 right now so it stalls at > about 32563. Any help would be appreciated. Doug, I'm not sure. I've redirected this back to the list in case somebody else can help. > > > Doug Holeman > > > > -Original Message- > From: Bob Showalter [mailto:[EMAIL PROTECTED] > Sent: Monday, September 29, 2003 8:24 PM > To: Holeman, Douglas; [EMAIL PROTECTED] > Subject: Re: Question on handling files. > > > Douglas Holeman wrote: > > I have had a request to pull 100,000+ names from a mailing list of > > 600,000+. I was hoping to figure out a way to get perl to help me > > do this. > > > > I stumbled through this code but the problem is I just wanted to > > know how to delete a line from the first list so I can be sure I > > dont pull the same 'random' line twice. > > > > Any help would be appreciated. I know this code sucks but I am not > > a programmer. > > > > > > > > > > open FEDOUT, ">c:\\perl\\bin\\129out.csv" or die "The Program has > > failed opening 129Out.csv: $!"; > > > > $x=1; > > $z=633856; > > print "1"; > > > > while ($x <= 129000) { > > open(FEDUP, "All.csv") or die "Can't find All.csv: $!\n"; > > > > $y = int(rand($z)+1); > > > > $a=1; > > > > > > while ($a < $y) { > > $line = ; > > # print $a," ",$y,"Z=", $z, "\n"; > > $a=$a+1; > > > > } > > > > > > print $x,"...", $y,"...", $z,"...", $a,"...", $line,"...","\n"; > > print (FEDOUT $line); $a=1; $y=1; > > > > $x=$x+1; > > close FEDUP; > > > > } > > > > print $x; > > close FEDOUT; > > close FEDUP; > > This code is going to be *extremely* slow, because you are > making many, many, many passes over the huge 600K row file. > > Here's a fairly efficient algorithm that requires only one pass > through the large file (after you know how many lines are in > it). Pass the name of the large file as the first command > line argument: > > #!/usr/bin/perl > use strict; > my $x = 633_856;# no. of rows in big file > my $y = 129_000;# no. of rows to select > my %h; > for (1..$y) { > print "$_\n" unless $_ % 1000; > my $i = int(rand($x)) + 1; > exists $h{$i} and redo; > $h{$i}++; > } > while(<>) { > exists $h{$.} and print; > } > > Here I'm basically creating a hash with 129k entries with keys > randomly selected from the range 1..633k. Then as we read > through the file, we just print each line whose number exists > as a key in the hash. > > Important: make sure your Perl has enough randbits. or you > won't get a distribution over this range. You'll need at least > 20 bits and preferrably 31 or more. Use "perl -V:randbits" to > check the figure. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on handling files.
Bob Showalter wrote: > for (1..$y) { > print "$_\n" unless $_ % 1000; Whoops! You can leave the line above out. I just stuck it in there so I could see the progress through the loop. > my $i = int(rand($x)) + 1; > exists $h{$i} and redo; > $h{$i}++; > } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on handling files.
Douglas Holeman wrote: > I have had a request to pull 100,000+ names from a mailing list of > 600,000+. I was hoping to figure out a way to get perl to help me do > this. > > I stumbled through this code but the problem is I just wanted to know > how to delete a line from the first list so I can be sure I dont pull > the same 'random' line twice. > > Any help would be appreciated. I know this code sucks but I am not a > programmer. > > > > > open FEDOUT, ">c:\\perl\\bin\\129out.csv" or die "The Program has > failed opening 129Out.csv: $!"; > > $x=1; > $z=633856; > print "1"; > > while ($x <= 129000) { > open(FEDUP, "All.csv") or die "Can't find All.csv: $!\n"; > > $y = int(rand($z)+1); > > $a=1; > > > while ($a < $y) { > $line = ; > # print $a," ",$y,"Z=", $z, "\n"; > $a=$a+1; > > } > > > print $x,"...", $y,"...", $z,"...", $a,"...", $line,"...","\n"; > print (FEDOUT $line); > $a=1; > $y=1; > > $x=$x+1; > close FEDUP; > > } > > print $x; > close FEDOUT; > close FEDUP; This code is going to be *extremely* slow, because you are making many, many, many passes over the huge 600K row file. Here's a fairly efficient algorithm that requires only one pass through the large file (after you know how many lines are in it). Pass the name of the large file as the first command line argument: #!/usr/bin/perl use strict; my $x = 633_856;# no. of rows in big file my $y = 129_000;# no. of rows to select my %h; for (1..$y) { print "$_\n" unless $_ % 1000; my $i = int(rand($x)) + 1; exists $h{$i} and redo; $h{$i}++; } while(<>) { exists $h{$.} and print; } Here I'm basically creating a hash with 129k entries with keys randomly selected from the range 1..633k. Then as we read through the file, we just print each line whose number exists as a key in the hash. Important: make sure your Perl has enough randbits. or you won't get a distribution over this range. You'll need at least 20 bits and preferrably 31 or more. Use "perl -V:randbits" to check the figure. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Question on handling files.
I have had a request to pull 100,000+ names from a mailing list of 600,000+. I was hoping to figure out a way to get perl to help me do this. I stumbled through this code but the problem is I just wanted to know how to delete a line from the first list so I can be sure I dont pull the same 'random' line twice. Any help would be appreciated. I know this code sucks but I am not a programmer. open FEDOUT, ">c:\\perl\\bin\\129out.csv" or die "The Program has failed opening 129Out.csv: $!"; $x=1; $z=633856; print "1"; while ($x <= 129000) { open(FEDUP, "All.csv") or die "Can't find All.csv: $!\n"; $y = int(rand($z)+1); $a=1; while ($a < $y) { $line = ; # print $a," ",$y,"Z=", $z, "\n"; $a=$a+1; } print $x,"...", $y,"...", $z,"...", $a,"...", $line,"...","\n"; print (FEDOUT $line); $a=1; $y=1; $x=$x+1; close FEDUP; } print $x; close FEDOUT; close FEDUP; Douglas Holeman Prepress Manager