Casey West wrote: > I'm beta-testing a robot that searches Google when new questions are > posed to the beginners' lists. I have no idea if it will be useful. > :-) > > I'm going to watch it closely and hope it is. I'll remove it if I > find that it does a bad job. > > Casey West
Hi Casey, I'm getting in on this sorta late, but here's my $0.02 worth: I don't mind getting the bot responses. I guess I may be in a minority on this subject though. One thing I do see is a fairly broad spectrum search that sometimes shoots pretty wide of the mark. There are a couple branches to this, in my view: 1. The search seems to respond to boilerplate with equal or greater weight than to the meat of the question. I se the same problem with the perldoc -q implementation on my computer. I've got some thoughts on approachesw to this, but I'll defer them to later, because they are pretty speculative. 2. There may be benefit to using a prioritized search pattern with the significant content of the search string. I have been working on an archive manager for my record of this list [actually a generalized mailbox archive manager, and here is the approach I took. I actually had three search options: Precise phrase [case-insensitive], all words, and any words. The current search pattern seem to be more of an all words search. It might help to narrow that down to demand matches on mutliple words. Within my all words serach, I also used a priority queue system for ordering response by significance. Here I scan the file keeping a count of total matches found, and ensuring that each word was matched at least once: Note that each entry in the hash pointed to by $found_in, and loaded by iterative calls to this routine has a 'count' element. input: $regexes--anonymous array of search strings $file_key--anonymous array of message sequences numbers $files--anonymous hash of filenames, keyed by the above $file_keys $found_in--anonympous hash to be loaded with filenames, keys, and counts sub seek_all_words_in_file { my ($regexes, $file_key, $files, $found_in) = @_; my $file = $files->{$file_key}; open IN, $file or die "Could not open $file $!"; my $matchcount = {}; $matchcount->{$_} = 0 foreach @$regexes; my $line; $line = <IN> until $line and $line eq "\x0A"; # This gets me past a header section of the file I'm scanning my $total_count; while (defined ($line = <IN>)) { foreach my $regex (@$regexes) { # get match counts per line of each regex if (my $line_match_count = () = $line =~ /$regex/gi) { $matchcount->{$regex} += $line_match_count; } } } my $matched_all = 1; for (@$regexes) { $matched_all = 0 if not $matchcount->{$_}; # filters if any words are missing } return if not $matched_all; my $count; $count += $matchcount->{$_} for @$regexes; $found_in->{$file_key}->{filename} = $file if not $found_in->{$file_key}; $found_in->{$file_key}->{count} = $count; } The calling function uses the above scanning routine thusly: ... while (my $file_key = shift @$file_keys) { seek_all_words_in_file($regexes, $file_key, $message_files, $found_in); } display_search_results($found_in, $search_dialog); ... handing it off to the following sub. Keep an eye on the hash pointed to by $best_bets, since that is the actual priority queue mechanism: sub display_search_results { my ($found_in, $search_dialog) = @_; our $message_viewer; our $message_list; my $best_bets = {}; foreach my $file_key (keys %$found_in) { my $file = $found_in->{$file_key}; my $line_count = $file->{count}; $best_bets->{$line_count} = [] if not $best_bets->{$line_count}; push @{$best_bets->{$line_count}}, $file_key; } $message_list->delete('all'); my $match_count = 0; foreach my $priority_level (sort {$b <=> $a} keys %$best_bets) { foreach my $file (sort {$b <=> $a} @{$best_bets->{$priority_level}}) { my $details = get_message_info($file); add_message_to_tree($file, $details, $message_list, $file) } } set_viewer_status('sort', 'none'); } Of course this still somewhat lacks subtlety. For one thing there is no weighting for the balance of search words in the file being searched. It might be better to give extra "points for files that had all words in roughly equal quantity. Between precise phrase and all words is also another standard, that I hadn't really tried to explore. That would be "words in order'. Something like this might be best with the record separator set to a period, so that it would scan text on a sentence-by-sentence basis, looking for all words in the same order as the search phrase, even if intermingled with other text. Unlike the above, I haven't built or tested this but a general algorithm for the regex might be: my $regex = quotemeta shift @search_words; regex .= ".*$word" while my $word = quotemeta shift @search_words; Whcih should render a regex that will match any string containing all @search_words in order. Oh yeah, the boilerplate, and what to do with it. It definitely seems that you would want to split it from the content being sought. Whether you could put it to use depends on whether you are varying procedures by search type. If you have a specialized "how do I" search routine, the phrase or significant parts of it might be used to switch to that routine, for instance. It seems that a good chunk of this boilerplate could be identified by a small table of words: how what do I|you do etc. Of course I'm not sure you were meaning to take all this on for the bot, but intelligent searches do pose some interesting challenges. Joseph -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>