On Sat, 2006-01-04 at 09:57 -0500, Frank Bax wrote: > I'm not the OP, but I have a script with a similar problem. The script has > some logic that generates many (thousands of billions) of combinations from > a little bit of data and only the best 100 combos are output. For each > combination produced by script, a value is calculated. I maintain an array > of "best" results so far; if size of this array is <100, then a "new" > result is automatically added. If there are already 100 items, I find the > "worst" entry in the array. If the "new" entry is better than the "worst", > the "new" one replaces the "worst". > > I am not sure how your reference to "fail early" and "succeed early" apply > to this situation. > > At the moment, the array is left unsorted. If I use a sorted array, it > needs to resorted every time a "worst" entry is replaced by a "new" > entry. Can I avoid sorting the array every iteration? How else to find > the "worst" entry? If I replace "worst" with a "new" entry, doesn't the > array need to be resorted? How does the cost of searching an unsorted > array compare to resorting on every update and searching a sorted array. > > Tom's idea about growing to 200, then chopping back to 100 also sounds > interesting. > > Frank
Below is the algorithm I was talking about. Let's say you want to keep the k best items of a list of size n, where best means the highest or lowest of them. If n <= k: sort. If k < n <= k * log2( k ): sort, discard excess. If n > k *log2( k ): use below. If k = 1_000 and n = 1_000_000_000, then 999_999_000 items are not going to be in the best list. Therefore, you want your algorithm to discard them as quickly as possible, hopefully, with just one test. This is what I mean by fail early; if most of the items are going to fail, design the algorithm so that failure is detected as early as possible. Similarly, succeed early means to design the algorithm to detect success as early as possible. The script below uses a linear search to insert an item into the best list. You could use a binary search or a heap instead. I would not use a binary tree or a balanced binary tree. These structures work best when you are doing more searches than insertions and in this case we are doing an insertion every time. #!/usr/bin/perl # -------------------------------------- # best_k -- Find the best k items of a set. # -------------------------------------- # Pragmas use strict; use warnings; # -------------------------------------- # Modules use Data::Dumper; use Getopt::Long; use POSIX; # -------------------------------------- # Configuration Parameters my $K = 10; # top ten $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 1; $Data::Dumper::Maxdepth = 0; # -------------------------------------- # Globals Variables my $Lowest = 0; # -------------------------------------- # Subroutines # -------------------------------------- # help(); # Print help via pod2text. # -------------------------------------- sub help { system( "pod2text -t $0" ); exit 0; } # -------------------------------------- # usage(); # Print usage via pod2usage. # -------------------------------------- sub usage { system( "pod2usage $0" ); exit 1; } # -------------------------------------- # init(); # Do things only possible when running. # -------------------------------------- sub init { unless( GetOptions( help => \&help, 'top|highest' => sub { $Lowest = 0; }, 'worst|lowest' => sub { $Lowest = 1; }, 'k=i' => \$K, )){ usage(); } die "bad k $K\n" if $K <= 1; } # -------------------------------------- # Main { my @items = (); my %item = (); init; # Get the best k while( <> ){ /^\D*(\d+)/; my $nbr = $1; %item = ( nbr => $nbr, orig => $_, ); if( @items == 0 ){ push @items, { %item }; next; } if( @items < $K || ( $Lowest && $nbr < $items[-1]{nbr} ) || ( ! $Lowest && $nbr > $items[-1]{nbr} ) ){ my $i = $#items; while( $i >= 0 && ( ( $Lowest && $nbr < $items[$i]{nbr} ) || ( ! $Lowest && $nbr > $items[$i]{nbr} ) ) ){ $i --; } if( $i < 0 ){ unshift @items, { %item }; }else{ splice( @items, $i+1, 0, { %item } ); } } $#items = $K - 1 if $#items >= $K; } # Display the results for my $item ( @items ){ print $item->{orig}; } } __END__ =head1 NAME best_k -- Find the best k items of a set. =head1 SYNOPSIS best_k [--top|highest|worst|lowest] [--k=<k>] [<file>] ... best_k --help =head2 Options =over 4 =item --top|highest|worst|lowest If top or highest, then the highest K items are chosen. If worst or lowest, then the lowest K items are chosen. Default it top. =item --k=<k> Set the number of items to choose. Default is 10. =item --help Print this message. =back =head1 REQUIRES =head1 DESCRIPTION Find the best k items of a set. =head1 FILES =head1 SEE ALSO =head1 AUTHOR Mr. Shawn H. Corey, B.Sc. =head1 HISTORY $Log$ =cut -- __END__ Just my 0.00000002 million dollars worth, --- Shawn "For the things we have to learn before we can do them, we learn by doing them." Aristotle * Perl tutorials at http://perlmonks.org/?node=Tutorials * A searchable perldoc is at http://perldoc.perl.org/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>