Re: Beta Testing a Robot

R. Joseph Newton Thu, 04 Dec 2003 22:48:58 -0800

Casey West wrote:

> I'm beta-testing a robot that searches Google when new questions are
> posed to the beginners' lists.  I have no idea if it will be useful.
> :-)
>
> I'm going to watch it closely and hope it is.  I'll remove it if I
> find that it does a bad job.
>
>   Casey West


Hi Casey,

I'm getting in on this sorta late, but here's my $0.02 worth:

I don't mind getting the bot responses.  I guess I may be in a minority
on this subject though.

One thing I do see is a fairly broad spectrum search that sometimes
shoots pretty wide of the mark.  There are a couple branches to this, in
my view:

1.  The search seems to respond to boilerplate with equal or greater
weight than to the meat of the question.  I se the same problem with the
perldoc -q implementation on my computer.  I've got some thoughts on
approachesw to this, but I'll defer them to later, because they are
pretty speculative.

2.  There may be benefit to using a prioritized search pattern with the
significant content of the search string.  I have been working on an
archive manager for my record of this list [actually a generalized
mailbox archive manager, and here is the approach I took.

I actually had three search options:  Precise phrase [case-insensitive],
all words, and any words.  The current search pattern seem to be more of
an all words search.  It might help to narrow that down to demand
matches on mutliple words.

Within my all words serach, I also used a priority queue system for
ordering response by significance.

Here I scan the file keeping a count of total matches found, and
ensuring that each word was matched at least once:  Note that each entry
in the hash pointed to by $found_in, and loaded by iterative calls to
this routine has a 'count' element.

input:
$regexes--anonymous array of search strings
$file_key--anonymous array of message sequences numbers
$files--anonymous hash of filenames, keyed by the above $file_keys
$found_in--anonympous hash to be loaded with  filenames, keys, and
counts

sub seek_all_words_in_file {
  my ($regexes, $file_key, $files, $found_in) = @_;

  my $file = $files->{$file_key};
  open IN, $file or die "Could not open $file $!";
  my $matchcount = {};
  $matchcount->{$_} = 0 foreach @$regexes;
  my $line;
  $line = <IN> until $line and $line eq "\x0A";
                                       #      This gets me past a header
section of the file I'm scanning
  my $total_count;
  while (defined ($line = <IN>)) {
    foreach my $regex (@$regexes) {          #   get match counts per
line of each regex
      if (my $line_match_count = () = $line =~ /$regex/gi) {
        $matchcount->{$regex} += $line_match_count;
      }
    }
  }
  my $matched_all = 1;
  for (@$regexes) {
    $matched_all = 0 if not $matchcount->{$_};   #  filters if any words
are missing
  }
  return if not $matched_all;
  my $count;
  $count += $matchcount->{$_} for @$regexes;
  $found_in->{$file_key}->{filename} = $file if not
$found_in->{$file_key};
  $found_in->{$file_key}->{count} = $count;
}

The calling function uses the above scanning routine thusly:

...
  while (my $file_key = shift @$file_keys) {
    seek_all_words_in_file($regexes, $file_key, $message_files,
$found_in);
  }
  display_search_results($found_in, $search_dialog);
...
handing it off to the following sub.  Keep an eye on the hash pointed to
by $best_bets, since that is the actual priority queue mechanism:


sub display_search_results {
  my ($found_in, $search_dialog) = @_;

  our $message_viewer;
  our $message_list;
  my $best_bets = {};
  foreach my $file_key (keys %$found_in) {
    my $file = $found_in->{$file_key};
    my $line_count = $file->{count};
    $best_bets->{$line_count} = [] if not $best_bets->{$line_count};
    push @{$best_bets->{$line_count}}, $file_key;
  }
  $message_list->delete('all');
  my $match_count = 0;
  foreach my $priority_level (sort {$b <=> $a} keys %$best_bets) {
    foreach my $file (sort {$b <=> $a} @{$best_bets->{$priority_level}})
{
      my $details = get_message_info($file);
      add_message_to_tree($file, $details, $message_list, $file)
    }
  }
  set_viewer_status('sort', 'none');
}

Of course this still somewhat lacks subtlety.  For one thing there is no
weighting for the balance of search words in the file being searched.
It might be better to give extra "points for files that had all words in
roughly equal quantity.  Between precise phrase and all words is also
another standard, that I hadn't really tried to explore.  That would be
"words in order'.  Something like this might be best with the record
separator set to a period, so that it would scan text on a
sentence-by-sentence basis, looking for all words in the same order as
the search phrase, even if intermingled with other text.  Unlike the
above, I haven't built or tested this but a general algorithm for the
regex might be:

my $regex = quotemeta shift @search_words;
regex .= ".*$word" while my $word = quotemeta shift @search_words;

Whcih should render a regex that will match any string containing all
@search_words in order.

Oh yeah, the boilerplate, and what to do with it.  It definitely seems
that you would want to split it from the content being sought.  Whether
you could put it to use depends on whether you are varying procedures by
search type.  If you have a specialized "how do I" search routine, the
phrase or significant parts of it might be used to switch to that
routine, for instance.  It seems that a good chunk of this boilerplate
could be identified by a small table of words:

how
what
do I|you do
etc.

Of course I'm not sure you were meaning to take all this on for the bot,
but intelligent searches do pose some interesting challenges.

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Beta Testing a Robot

Reply via email to