Searching perl

Bill Moseley 7 Mar 2002 03:41:58 -0000

Hi,

First, a note on the Javascript error.  I did a File->Save to disk, and
then read that into NS4.0 and the Javascript worked.  The difference is the
style sheet is not loaded when reading the page off disk.  I wonder if it's
just the CSS or if the stylesheet is making it hard to find the correct
form name.


I've been testing out using swish-e's "buzzword" feature to see if that can
help search perl code.  Normally, something like $| is not indexed, but if
that's defined as a buzzword then swish will index that.  Of course, swish
has to find that string, and how it does that is by tokenizing the text
into white-space separated words, then removing leading and trailing
characters.  These are defined with the IgnoreFirstChar and IgnoreLastChar
settings.

For example:

Buzzwords mod_perl c++ $_ $+ $| $_ @$ @_ $ENV $SIG
IgnoreFirstChar ({[]}):,
IgnoreLastChar ({[]}):,.

So it will find $| (or any of the other ones defined) if they are separated
by white space (and optionally those characters).

It's not perfect, but it gets us closer.  Clearly, it won't catch all
situations.  You can try it at the URL I posted earlier. It will be
interesting to use grep to compare -- and that might help to improve indexing.

Stas, you recommended creating two indexes.  But I've been having a hard
time finding a set of characters that could be used to separate the parts
of the perl code that someone might want to search.  Plus that will
complicate searching.  So I think using these "buzzwords" will be probably
our best bet.

Normally, (without buzzwords) swish has a list of characters called
"WordCharacters" which defines the allowable characters in a word.

A more typical setup might be to define WordCharacters as A-Z and, say, the
underscore.  And then set IgnoreFirstChar and IgnoreLastChar also to the
underscore.  Then swish indexes: "mod_perl", "__mod_perl", and "mod_perl__"
all as "mod_perl".

This makes it "somewhat" easy to do term highlighting since you can build a
regular expression:

  @words = split /^$ignoref([$wc]+?)$ignorel$/i, $text;

and end up with an array of all the text, separated into words that swish
indexes and all the other text.  That makes processing easy.

But throw in buzzwords and it's not that simple, since the text has to be
spilt by whitespace, stripped of the the IgnoreFirstChar and IgnoreLastChar
settings, checked for buzzwords, then split by wordcharacters.

Why am I telling you this.  The first reason is that the term highlighting
(which I spent most of the day writing) is really slow.  It has to process
each word one by one, and the code is kind of ugly.  It's not simple regex
search and replace, which would be a lot faster, but less accurate and
would highlight phrases.  I kid of think it's important that the
highlighting is reasonably accurate.

The second reasons is maybe if I'm lucky someone will tell me how to make
it faster!  I've asked on Perlmonks, but I've not had great luck there (I
think it's a more complicated problem than can be answered in a short
lookover by someone.  Someday I'll write it in C, but I'm not sure that
would be that much faster, either.

Anyway, think that will get us by for a while for searching perl code?


-- 
Bill Moseley
mailto:[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Searching perl

Reply via email to