Re: how to optimize a string search

2003-11-10 Thread R. Joseph Newton
Ramprasad A Padmanabhan wrote:

 I know this is more of an algorithm question but please bear with me.

 In my program I am checking wether a emailid exists in a list
 I have in the complete_list a string like
 $complete_string=[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] ... 
 [EMAIL PROTECTED];

 #  that is a string of emailids sorted

 Now I want to find out if another ID  [EMAIL PROTECTED] exists in this list

 How is it possible to optimize the search  given that the complete list
 string is sorted in order

Don't optimize--at least not until you are sure your getting the information
right.  It is just the wrong emphasis.

Here are a couple of approaches that have worked for me.  The first, used in
building an index, is predicated on the concept of saving all possible
information.  The second, used in loading a tree structure, takes as a design
imperative that any node should focus only on one prime reference, in this case
the post being replied to.  The wealth of possibilities can be a recipe for
inifnite loops when loading a tree, hence it's better to throw away some
information than to follow potentially circular links.

First approach:

Setup:

The only information available to this routine is the $message_info hash
reference.  This has the form:

$message_info = {
 $sequence_number1 = {
  'Message-ID' = $message_id,
  Subject  = $subject
   ...
  }
  ...
}

The references item is a potentially multi-item string, and this routine
handles this.  $reference_index and $no_references are sent into the
gather_References_items function to load their data structures.  The latter,
no_references, is an afterthought.


sub create_References_index {
  my $message_info = $_[0];
  my $reference_index = {};
  my $no_references = [];

  gather_References_items($message_info, $reference_index, $no_references);

  open OUT, dbm/referenc.ndi or die
   Could not open file to write Reference index: $!;
  foreach my $column_value (sort {uc $a cmp uc $b} keys %$reference_index) {
my $index_row = column_index_line($reference_index, $column_value);
print OUT $index_row\n;
  }
  close OUT or die
   Could not close file after writing Reference index: $!;
  create_no_reference_index($no_references);
}

sub gather_References_items {
  my ($message_info, $reference_index, $no_references) = @_;

  foreach my $msg_key (keys %$message_info) {
my $data_field = $message_info-{$msg_key}-{'References'};
if (!$data_field) {
  push(@$no_references, $msg_key);
}
next if !$data_field or $data_field !~ /\S/;
$data_field =~ s/^(.*)$/$1/;
my @references = split /\s*/, $data_field;
foreach my $reference (@references) {
  $reference_index-{$reference} = []   # create the anonymous target array

   if !defined($reference_index-{$reference});  # if it doesn't exist
  push @{$reference_index-{$reference}}, $msg_key;
}
  }
}


The above very hadily preserves all sequence numbers of posts referencing any
given Message-ID.  The following is more focused on getting just one [hoefully
the best, but more importantly cogent].  It is used in the active process of
loading a threaded mail index.  As I said, the focus here is on keeping the
data lean and focused.  That is why, instead of dealing with the whole list of
References items, I just pop until I find the one that seems best suited, then
get out, leaving extraneous data behind.  I tried using different data
structures to hold different kinds of relationships, and ended up fighting with
the loading routine all day.  This works:


sub establish_parentage {
  my ($message_details, $subjects, $message_ids, $threads) = @_;

  my $no_references = [];
  my $candidate_roots = [sort {$a = $b} keys %$message_details];
  my $parents = {};
  my $temp = [];
  while (my $candidate = shift @$candidate_roots) {
seek_explicit_references($candidate, $message_details,
 $no_references, $parents, $message_ids);
  }
  $candidate_roots = $no_references;
  $no_references = [];
  while (my $candidate = shift @$candidate_roots) {
seek_implicit_references($candidate, $message_details, $no_references,
$parents,
 $subjects);
  }

  return $parents;
}

sub seek_explicit_references {
  my ($candidate, $message_details, $no_references, $parents, $message_ids) =
@_;

  my $details = $message_details-{$candidate};
  return if $details-{'parent'};
  my $references = $details-{'References'};
  $references =~ s/^\s+//;
  $references =~ s/\s+$//;
  my @references = split /\s+/, $references;
  my $found = 0;
  while (my $reference = pop @references) {
if (my $parent_sequence = $message_ids-{$reference}) {
  $details-{'parent'} = $parent_sequence;
  $parents-{$candidate} = $parent_sequence;
  $found = 1;
  last;
}
  }
  push @$no_references, $candidate if not $found;
}

sub seek_implicit_references {
  my ($candidate, $message_details, $no_references, $parents, $subjects) = @_;
#  We won't 

Re: how to optimize a string search

2003-11-06 Thread John W. Krahn
Ramprasad A Padmanabhan wrote:
 
 I know this is more of an algorithm question but please bear with me.
 
 In my program I am checking wether a emailid exists in a list
 I have in the complete_list a string like
 $complete_string=[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] ... 
 [EMAIL PROTECTED];

Why store it in a string and not an array?

 #  that is a string of emailids sorted
 
 Now I want to find out if another ID  [EMAIL PROTECTED] exists in this list
 
 How is it possible to optimize the search  given that the complete list
 string is sorted in order

The fastest string search method IIRC is Boyer-Moore and variants
thereof.  Perl provides the index() function for searching strings which
may be fast enough for your purposes.  If you had stored the sorted list
in an array then you could have used a binary search which is very
efficient.

 The alternative I am considering is
1) Create a hash array with each of the ids as keys and just use the
 exits function

This is also very efficient as long as the keys are unique.

 like
 my %hash=();
 while($complete_list=~/(\.*?\)/g){ $hash{$1}=1 };

Or:

my %hash = map { $_ = 1 } $complete_list =~ /[^]*/g

 
 if(exists($hash{$emailid})) {
 # FOUND id in string
 }
 
 Is this the best way

The hash has faster lookups but uses more memory then a single string.

 Is there no way directly to use the string

If all the entries in the string were the same length (space padded)
then you could use a binary search on the string.


John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to optimize a string search

2003-10-27 Thread Ramprasad A Padmanabhan
I know this is more of an algorithm question but please bear with me.

In my program I am checking wether a emailid exists in a list
I have in the complete_list a string like
$complete_string=[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] ... [EMAIL 
PROTECTED];
#  that is a string of emailids sorted

Now I want to find out if another ID  [EMAIL PROTECTED] exists in this list

How is it possible to optimize the search  given that the complete list 
string is sorted in order

The alternative I am considering is
  1) Create a hash array with each of the ids as keys and just use the 
exits function

like
my %hash=();
while($complete_list=~/(\.*?\)/g){ $hash{$1}=1 };

if(exists($hash{$emailid})) {
# FOUND id in string
}
Is this the best way

Is there no way directly to use the string
Thanks
Ram
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: how to optimize a string search

2003-10-27 Thread Rob Dixon
Ramprasad A Padmanabhan wrote:

 I know this is more of an algorithm question but please bear with me.


 In my program I am checking wether a emailid exists in a list
 I have in the complete_list a string like
 $complete_string=[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] ... 
 [EMAIL PROTECTED];

 #  that is a string of emailids sorted

 Now I want to find out if another ID  [EMAIL PROTECTED] exists in this list

 How is it possible to optimize the search  given that the complete list
 string is sorted in order

 The alternative I am considering is
1) Create a hash array with each of the ids as keys and just use the
 exits function

 like
 my %hash=();
 while($complete_list=~/(\.*?\)/g){ $hash{$1}=1 };
 
 if(exists($hash{$emailid})) {
 # FOUND id in string
 }

 Is this the best way

 Is there no way directly to use the string

Hi Ram.

If you want to check many different mail addresses against the list then it
will be best to build a hash. You could assign a slice, like this:

  my %hash;
  @hash{ $complete_string =~ /(\.*?\)/g } = ();

  if (exists $hash{$emailid}) {
:
  }

but beware that the values for each element will be
'undef' if you do this, so that you have to use 'exists'
instead of 'defined'.

If you're just testing a single mail address then you'd probably
be better off doing

  if (index($complete_string, $emailid) = 0) {
:
  }

HTH,

Rob



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]