Re: Searching perl

Stas Bekman 7 Mar 2002 04:05:41 -0000

Bill Moseley wrote:

Hi,

First, a note on the Javascript error.  I did a File->Save to disk, and
then read that into NS4.0 and the Javascript worked.  The difference is the
style sheet is not loaded when reading the page off disk.  I wonder if it's
just the CSS or if the stylesheet is making it hard to find the correct
form name.

I've been testing out using swish-e's "buzzword" feature to see if that can
help search perl code.  Normally, something like $| is not indexed, but if
that's defined as a buzzword then swish will index that.  Of course, swish
has to find that string, and how it does that is by tokenizing the text
into white-space separated words, then removing leading and trailing
characters.  These are defined with the IgnoreFirstChar and IgnoreLastChar
settings.

For example:

Buzzwords mod_perl c++ $_ $+ $| $_ @$ @_ $ENV $SIG
IgnoreFirstChar ({[]}):,
IgnoreLastChar ({[]}):,.

So it will find $| (or any of the other ones defined) if they are separated
by white space (and optionally those characters).

It's not perfect, but it gets us closer.  Clearly, it won't catch all
situations.  You can try it at the URL I posted earlier. It will be
interesting to use grep to compare -- and that might help to improve indexing.

Stas, you recommended creating two indexes.  But I've been having a hard
time finding a set of characters that could be used to separate the parts
of the perl code that someone might want to search.  Plus that will
complicate searching.  So I think using these "buzzwords" will be probably
our best bet.

Normally, (without buzzwords) swish has a list of characters called
"WordCharacters" which defines the allowable characters in a word.

A more typical setup might be to define WordCharacters as A-Z and, say, the
underscore.  And then set IgnoreFirstChar and IgnoreLastChar also to the
underscore.  Then swish indexes: "mod_perl", "__mod_perl", and "mod_perl__"
all as "mod_perl".

This makes it "somewhat" easy to do term highlighting since you can build a
regular expression:

  @words = split /^$ignoref([$wc]+?)$ignorel$/i, $text;

and end up with an array of all the text, separated into words that swish
indexes and all the other text.  That makes processing easy.

But throw in buzzwords and it's not that simple, since the text has to be
spilt by whitespace, stripped of the the IgnoreFirstChar and IgnoreLastChar
settings, checked for buzzwords, then split by wordcharacters.

Why am I telling you this.  The first reason is that the term highlighting
(which I spent most of the day writing) is really slow.  It has to process
each word one by one, and the code is kind of ugly.  It's not simple regex
search and replace, which would be a lot faster, but less accurate and
would highlight phrases.  I kid of think it's important that the
highlighting is reasonably accurate.

The second reasons is maybe if I'm lucky someone will tell me how to make
it faster!  I've asked on Perlmonks, but I've not had great luck there (I
think it's a more complicated problem than can be answered in a short
lookover by someone.  Someday I'll write it in C, but I'm not sure that
would be that much faster, either.

Anyway, think that will get us by for a while for searching perl code?

Hmm, you mean the thread we had before and the outcome were useless? I thought we have agreed on putting most of the chars into the WordCharacters variable. I'm absolutely against trying to come with a list of buzzwords, since it's going to take so much work and you still won't cover all the required "special" words. If we can search for Apache::Registry and $! because they are "words", that's all we need. This is what I've suggested to add to WordCharacters: >-$%@:*[]{}|&

Actually it would be cool if swish-e was able to accept char sequences as valid word segments. For example we don't want > or - to be counted as parts of the word, but we do want -> (think $r->no_cache)

Also you were talking about highlighting. Are you talking about highlighting for the snippets presented on the hits page? Or are you working on highlighting in the real 'full' text?

_____________________________________________________________________
Stas Bekman             JAm_pH      --   Just Another mod_perl Hacker
http://stason.org/      mod_perl Guide   http://perl.apache.org/guide
mailto:[EMAIL PROTECTED]  http://ticketmaster.com http://apacheweek.com
http://singlesheaven.com http://perl.apache.org http://perlmonth.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching perl

Reply via email to