On Sat, 2006-01-04 at 09:57 -0500, Frank Bax wrote:
> I'm not the OP, but I have a script with a similar problem.  The script has 
> some logic that generates many (thousands of billions) of combinations from 
> a little bit of data and only the best 100 combos are output. For each 
> combination produced by script, a value is calculated.  I maintain an array 
> of "best" results so far; if size of this array is <100, then a "new" 
> result is automatically added.  If there are already 100 items, I find the 
> "worst" entry in the array.  If the "new" entry is better than the "worst", 
> the "new" one replaces the "worst".
> 
> I am not sure how your reference to "fail early" and "succeed early" apply 
> to this situation.
> 
> At the moment, the array is left unsorted.  If I use a sorted array, it 
> needs to resorted every time a "worst" entry is replaced by a "new" 
> entry.  Can I avoid sorting the array every iteration?  How else to find 
> the "worst" entry?  If I replace "worst" with a "new" entry, doesn't the 
> array need to be resorted?  How does the cost of searching an unsorted 
> array compare to resorting on every update and searching a sorted array.
> 
> Tom's idea about growing to 200, then chopping back to 100 also sounds 
> interesting.
> 
> Frank  

Below is the algorithm I was talking about.

Let's say you want to keep the k best items of a list of size n, where
best means the highest or lowest of them.

If n <= k: sort.

If k < n <= k * log2( k ): sort, discard excess.

If n > k *log2( k ): use below.

If k = 1_000 and n = 1_000_000_000, then 999_999_000 items are not going
to be in the best list. Therefore, you want your algorithm to discard
them as quickly as possible, hopefully, with just one test. This is what
I mean by fail early; if most of the items are going to fail, design the
algorithm so that failure is detected as early as possible.

Similarly, succeed early means to design the algorithm to detect success
as early as possible.

The script below uses a linear search to insert an item into the best
list. You could use a binary search or a heap instead. I would not use a
binary tree or a balanced binary tree. These structures work best when
you are doing more searches than insertions and in this case we are
doing an insertion every time.


#!/usr/bin/perl
# --------------------------------------
# best_k -- Find the best k items of a set.

# --------------------------------------
# Pragmas
use strict;
use warnings;

# --------------------------------------
# Modules
use Data::Dumper;
use Getopt::Long;
use POSIX;

# --------------------------------------
# Configuration Parameters

my $K = 10; # top ten

$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent   = 1;
$Data::Dumper::Maxdepth = 0;

# --------------------------------------
# Globals Variables

my $Lowest = 0;

# --------------------------------------
# Subroutines

# --------------------------------------
# help();
#   Print help via pod2text.
# --------------------------------------
sub help {
  system( "pod2text -t $0" );
  exit 0;
}

# --------------------------------------
# usage();
#   Print usage via pod2usage.
# --------------------------------------
sub usage {
  system( "pod2usage $0" );
  exit 1;
}

# --------------------------------------
# init();
#   Do things only possible when running.
# --------------------------------------
sub init {
  unless( GetOptions(
      help => \&help,
      'top|highest' => sub { $Lowest = 0; },
      'worst|lowest' => sub { $Lowest = 1; },
      'k=i' => \$K,
  )){
    usage();
  }

  die "bad k $K\n" if $K <= 1;
}

# --------------------------------------
# Main
{
  my @items = ();
  my %item = ();

  init;

  # Get the best k
  while( <> ){
    /^\D*(\d+)/;
    my $nbr = $1;
    %item = (
      nbr => $nbr,
      orig => $_,
    );

    if( @items == 0 ){
      push @items, { %item };
      next;
    }

    if( @items < $K
    || ( $Lowest && $nbr < $items[-1]{nbr} )
    || ( ! $Lowest && $nbr > $items[-1]{nbr} )
    ){
      my $i = $#items;
      while( $i >= 0
      && ( ( $Lowest && $nbr < $items[$i]{nbr} )
        || ( ! $Lowest && $nbr > $items[$i]{nbr} ) )
      ){
        $i --;
      }
      if( $i < 0 ){
        unshift @items, { %item };
      }else{
        splice( @items, $i+1, 0, { %item } );
      }
    }
    $#items = $K - 1 if $#items >= $K;

  }

  # Display the results
  for my $item ( @items ){
    print $item->{orig};
  }
}

__END__

=head1 NAME

best_k -- Find the best k items of a set.

=head1 SYNOPSIS

  best_k [--top|highest|worst|lowest] [--k=<k>] [<file>] ...
  best_k --help

=head2 Options

=over 4

=item --top|highest|worst|lowest

If top or highest, then the highest K items are chosen.
If worst or lowest, then the lowest K items are chosen.
Default it top.

=item --k=<k>

Set the number of items to choose.
Default is 10.

=item --help

Print this message.

=back

=head1 REQUIRES

=head1 DESCRIPTION

Find the best k items of a set.

=head1 FILES

=head1 SEE ALSO

=head1 AUTHOR

Mr. Shawn H. Corey, B.Sc.

=head1 HISTORY

  $Log$

=cut

-- 
__END__

Just my 0.00000002 million dollars worth,
   --- Shawn

"For the things we have to learn before we can do them,
we learn by doing them."
  Aristotle

* Perl tutorials at http://perlmonks.org/?node=Tutorials
* A searchable perldoc is at http://perldoc.perl.org/




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to