Hi,
First, a note on the Javascript error. I did a File->Save to disk, and then read that into NS4.0 and the Javascript worked. The difference is the style sheet is not loaded when reading the page off disk. I wonder if it's just the CSS or if the stylesheet is making it hard to find the correct form name.
I've been testing out using swish-e's "buzzword" feature to see if that can help search perl code. Normally, something like $| is not indexed, but if that's defined as a buzzword then swish will index that. Of course, swish has to find that string, and how it does that is by tokenizing the text into white-space separated words, then removing leading and trailing characters. These are defined with the IgnoreFirstChar and IgnoreLastChar settings.
For example:
Buzzwords mod_perl c++ $_ $+ $| $_ @$ @_ $ENV $SIG IgnoreFirstChar ({[]}):, IgnoreLastChar ({[]}):,.
So it will find $| (or any of the other ones defined) if they are separated by white space (and optionally those characters).
It's not perfect, but it gets us closer. Clearly, it won't catch all situations. You can try it at the URL I posted earlier. It will be interesting to use grep to compare -- and that might help to improve indexing.
Stas, you recommended creating two indexes. But I've been having a hard time finding a set of characters that could be used to separate the parts of the perl code that someone might want to search. Plus that will complicate searching. So I think using these "buzzwords" will be probably our best bet.
Normally, (without buzzwords) swish has a list of characters called "WordCharacters" which defines the allowable characters in a word.
A more typical setup might be to define WordCharacters as A-Z and, say, the underscore. And then set IgnoreFirstChar and IgnoreLastChar also to the underscore. Then swish indexes: "mod_perl", "__mod_perl", and "mod_perl__" all as "mod_perl".
This makes it "somewhat" easy to do term highlighting since you can build a regular expression:
@words = split /^$ignoref([$wc]+?)$ignorel$/i, $text;
and end up with an array of all the text, separated into words that swish indexes and all the other text. That makes processing easy.
But throw in buzzwords and it's not that simple, since the text has to be spilt by whitespace, stripped of the the IgnoreFirstChar and IgnoreLastChar settings, checked for buzzwords, then split by wordcharacters.
Why am I telling you this. The first reason is that the term highlighting (which I spent most of the day writing) is really slow. It has to process each word one by one, and the code is kind of ugly. It's not simple regex search and replace, which would be a lot faster, but less accurate and would highlight phrases. I kid of think it's important that the highlighting is reasonably accurate.
The second reasons is maybe if I'm lucky someone will tell me how to make it faster! I've asked on Perlmonks, but I've not had great luck there (I think it's a more complicated problem than can be answered in a short lookover by someone. Someday I'll write it in C, but I'm not sure that would be that much faster, either.
Anyway, think that will get us by for a while for searching perl code?
Hmm, you mean the thread we had before and the outcome were useless?
I thought we have agreed on putting most of the chars into the WordCharacters variable. I'm absolutely against trying to come with a list of buzzwords, since it's going to take so much work and you still won't cover all the required "special" words. If we can search for Apache::Registry and $! because they are "words", that's all we need.
This is what I've suggested to add to WordCharacters: >-$%@:*[]{}|&
Actually it would be cool if swish-e was able to accept char sequences as valid word segments. For example we don't want > or - to be counted as parts of the word, but we do want -> (think $r->no_cache)
Also you were talking about highlighting. Are you talking about highlighting for the snippets presented on the hits page? Or are you working on highlighting in the real 'full' text?
_____________________________________________________________________ Stas Bekman JAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide http://perl.apache.org/guide mailto:[EMAIL PROTECTED] http://ticketmaster.com http://apacheweek.com http://singlesheaven.com http://perl.apache.org http://perlmonth.com/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
