Hi Stas,

I just updated the search site for Apache.org with a newer version of
swish.  The context highlighting is a bit silly, but that can be fixed.
I'm only caching the first 15K of text from each page for context
highlighting.

http://search.apache.org

It seems reasonably fast (it's not running under mod_perl currently, but
could -- if mod_perl was in that server ;).

It takes about eight or nine minutes to reindex ~35,000 docs on *.apache.org
so the mod_perl list (and others) shouldn't too much trouble, I'd think,
with smaller numbers and smaller content.

It doesn't do incremental indexing at this point, which is a draw back, but
indexing is so fast it normally doesn't matter (and there's an easy
work-around for something like a mailing list to pickup new messages as
they come in during the day).

Swish-e can also call a perl program which feeds docs to swish.  That makes
it easy to parse the email into fields for something like:

  http://swish-e.org/Discussion/search/swish.cgi

which looks a lot like the Apache search site...

But, what would be needed is a good threaded mail archiver, which there are
many to pick from, I'd expect.

>Some 
>archives are browsable, but their search engines simply suck. e.g. 
>marc.theaimsgroup.com I think is the only one that archives 
>[EMAIL PROTECTED], but if you try to seach for perl string like 
>APR::Table::FETCH it won't find anything. If you search for
>get_dir_config it will split it into 'get', 'dir', 'config' and give you 
>a zillion matches when you know that there are just a few.

On swish you could say ":" and "_" are part of words and those would index
as full words.  Or, just simply search for phrase: "get_dir_config" and it
would search for the phrase "get dir config" which would probably find what
you want.

Maybe : and _ are ok in words, but you have to think carefully about
others.  It's more flexible to split the words and use phrases in many cases.



Bill Moseley
mailto:[EMAIL PROTECTED]

Reply via email to