RE: Random sampling in perl

Balint, Jess Fri, 22 Mar 2002 13:01:59 -0800

I just typed the code in the e-mail. It was part of my program. Sorry.

-----Original Message-----
From: Wagner-David [mailto:[EMAIL PROTECTED]]
Sent: Friday, March 22, 2002 4:16 PM
To: 'Balint, Jess'; '[EMAIL PROTECTED]'
Subject: RE: Random sampling in perl



        If the code then change if( $i = $array[$j] ) to if( $i ==
$array[$j] ) otherwise you end assigning $i with the value of $array[$i].

Wags ;)

-----Original Message-----
From: Balint, Jess [mailto:[EMAIL PROTECTED]]
Sent: Friday, March 22, 2002 13:04
To: '[EMAIL PROTECTED]'
Subject: Re: Random sampling in perl


Thanks for the input. The only trouble I would have with that is the file
size. My files are HUGE. I don't think the admins around here would like me
doing that. What I was thinking was to generate an array with a bunch of
random numbers in numerical order. Then run through the file and print only
those lines.

The array would contain as many elements as the number of samples needed.
eg. $#array = $sampsize. Then I was thinking something like this:

my $i, $j;
select(OUT);
while(<FILE>) {
        $i++;           #current line of file
        if( $i = $array[$j] ) {         # where j is the position in the
array
                print;
                $j++;
        }
}

The main trouble I am having is create the random array. I need to make sure
there any no dups.

my @array;
while( $#array <= $sampsize ) {
        $rnd = rand $sampsize;
        # check for dups
        # store in array
}

I think once I have the array created I should be fine. Any ideas? Thanks ..
...
-J-e-s-s-

Jess Balint wrote:
> 
> Hello all, I have a file of 3,210,008 CSV records. I need to take a random
> sample of this. I tried hacking something together a while ago, but it
> seemed to repeat 65,536 different records. When I need a 5mil sample, this
> creates a problem.
> 
> Here is my old code: I know the logic allows dups, but what would incur
the
> limit? I think with 500,000 samples there wouldn't be a problem getting
more
> than 65536 diff records, but that number is too ironic for me to deal
with.
> Thanks.
> 
> #!/usr/local/bin/perl -w
> 
> open (FILE,"consumer.sample.sasdump.txt");
> open (NEW,">consumer.new");
> 
> @data = <FILE>;
> 
> for ( $jess == 1; $jess < 500000; $jess++ ) {
>         $index = rand @data;
>         print NEW $data[$index];
> }
> 
> close(FILE);
> close(NEW);


This should do what you want:

#!/usr/local/bin/perl -w
use strict;

srand;

open FILE, 'consumer.sample.sasdump.txt' or die "Cannot open
'consumer.sample.sasdump.txt': $!";
open NEW,  '> consumer.new' or die "Cannot open 'consumer.new': $!";

my @data = <FILE>;

for ( 1 .. 500_000 ) {
    print NEW splice @data, rand @data, 1;
}

close FILE;
close NEW;



John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Random sampling in perl

Reply via email to