Re: Guide search engine (was Re: multiple copies of a module)

2000-05-22 Thread Stas Bekman

On Wed, 17 May 2000, Jeremy Howard wrote:

 ...the perl.apache.org search facility
   *  Where is it? (doing a Find on the front page doesn't show it)
  
  At the bottom of all guide pages.
  
 How funny--I'd never even noticed it!
 
 I see that it's using 'Swish-E' http://sunsite.berkeley.edu/SWISH-E/. Stas--did 
you get that up and running? Can we tailor it for our needs?
 
 Here's an attempt at listing what I think we've decided we should aim for:
 - Allow restriction of search to just the guide
 - Allow searching of other documents through a popup selection (probably make the 
guide the default?)
 - Highlight found words
 - Try and index in a way that suits programmers, not English writers. e.g. include 
@, %, $, ::, in indexed words.
 
 Have I missed anything? (I'm ignoring the docbook issue for the moment since it's 
not directly related, and I guess it's really Stas' call anyhow.)

So far these are the engines that we are going to deplo:

1st search: Randy Kobes: swish engine + perl filters
http://theoryx5.uwinnipeg.ca/cgi-bin/guide-search

2nd search: Vivek Khera: nextrieve engine
http://thingy.kcilink.com/cgi-bin/modperlguide.cgi

Both more or less cover the demands from my yours and mine wishlists.

I'll link to these from the Guide. You are welcome to present other search
engines if you think you can get a better one. I promise to link to all of
them, assuming that you will take the responsibility to keep up with
updates. I'll delete all the references to search engines which will not
update their indexed version as I did before with some quite good search
engine that didn't keep up with updates and had half a year old version,
and users were using a *very* outdated guide as a result.

 I would have thought the best bet would be to put it on the footer of every 
perl.apache.org page. A popup which allows selecting a subset of the site might 
default to either 'whole site' or 'mod_perl Guide', or maybe it changes it's default 
to whatever part of the site is currently being viewed...
 
 The outstanding issues, I believe, are:
 - Who looks after the perl.apache.org search facility? Are they happy to expand its 
functionality as described?
 - What tool? Potential options so far are Swish-e, htdig, or custom Perl (perhaps 
based on Matt's engine). Any of these could be piped through a word-hilighting filter
 - What's the best 1st step? i.e. How can we get a simple search going quickly, while 
providing the foundation for a more complete system down the track?
 - Who's going to do the actual work? As I've mentioned, if a machine is required, 
I'm happy to provide it. However, I don't have the experience in this area to lead 
the work--although of course I'll contribute where I can! It would be nice to get a 
private mailing list going to avoid filling up this list too much more.
 
 Anyone who's interesting in getting involved, email me, and I'll ensure that I add 
your name to the list. You don't have to be a programming guru, of course... there's 
always plenty of ways to get involved in these things.

Well, things are just happening. Vivek and Randy already created their
versions and presented them at the list, received feedbacks, made
corrections and have the engines working. As I've mentioned above you are
very welcome to beat their achievement and get an even better engine :)

P.S. Asking "who is going to do that" is a bad idea on this list... I'm
not bitching, Just telling the fact. If you want something to be done
either ask for help or do it yourself. 


_
Stas Bekman  JAm_pH --   Just Another mod_perl Hacker
http://stason.org/   mod_perl Guide  http://perl.apache.org/guide 
mailto:[EMAIL PROTECTED]   http://perl.org http://stason.org/TULARC
http://singlesheaven.com http://perlmonth.com http://sourcegarden.org




Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Matt Sergeant

BTW: Your email client is broken and not wrapping words.

On Wed, 17 May 2000, Jeremy Howard wrote:

 Stas Bekman wrote:
  Hold on, at this very moment a few mod_perl fellas are working on having a
  good search engine for the guide. Just give it some more time, I'm trying
  to bring the best so it'll take a while...

 I'm glad you brought this up again. Since I mentioned I'd be happy to
 host such a thing, and asked for suggestions, I've got a total of one
 (from Stas--thanks!). That suggestion was to use ht://dig
 http://www.htdig.org/.

While htdig is a reasonable engine, Stas's idea is this needs to be "guide
specific". Meaning what I'm not sure, but I'm assuming it means to pick
out only certain words to index...

 Has anyone got a search engine up and running that they're happy with?

I just wrote a very simple SQL based engine - so I would say I'm happy
with that. It's fast and it's all in perl. I could very simply rip out the
search parts of the code for someone to play with if they wanted to.

 Stas has made the good point that it needs to be able to hilight found
 words, since the pages are quite large. If anyone has a chance to do a
 bit of research about (free) search engines, I'd really appreciate it
 if you could let me know what you find out. It'd be nice publicity if
 it was mod_perl based, I guess, but it doesn't really matter.

I think word highlighting is overrated. It's only necessary in this case
because the guide is so damn huge now. The size problem could be
eliminated by making the guide split itself up into smaller sections. My
proposal would be to do that by converting the guide to docbookXML and use
AxKit to display the resulting docbook pages. The AxKit docbook
stylesheets are nice and friendly, and written in Perl, not some obscure
XML stylesheet language. And after all that, it would make converting the
guide to a format O'Reilly likes to publish (i.e. docbook), trivial.

 My only concern is that it seems a little odd to keep this just to the
 Guide. Wouldn't it be useful for the rest of perl.apache.org? I
 wouldn't have thought it's much extra work to add a drop-down box to
 search specific areas of the sight (the Guide being one)...

perl.apache.org already has a search engine.

 If there's a good reason to have the Guide's search engine separate to
 the rest of perl.apache.org, should it have a separate domain
 (modperlguide.org?, guide.perl.apache.org?)?

guide.modperl.org ?

-- 
Matt/

Fastnet Software Ltd. High Performance Web Specialists
Providing mod_perl, XML, Sybase and Oracle solutions
Email for training and consultancy availability.
http://sergeant.org http://xml.sergeant.org




Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Jay J

Jeremy Howard wrote:

 I'm glad you brought this up again. Since I mentioned I'd be happy to
 host such a thing, and asked for suggestions, I've got
 a total of one (from Stas--thanks!). That suggestion was to use
 ht://dig http://www.htdig.org/.

 Has anyone got a search engine up and running that they're happy with?
 Stas has made the good point that it needs to be able
 to hilight found words, since the pages are quite large. If anyone has
 a chance to do a bit of research about (free) search
 engines, I'd really appreciate it if you could let me know what you
 find out. It'd be nice publicity if it was mod_perl based,
 I guess, but it doesn't really matter.

I'm happy with ht://dig, I use it mainly for looking up docs I've
squirreled away in /manual. (instead of grep)

It's been a while since I've been to htdig.org but I did grab a tarball
recently, so I'm fairly confident there isn't* an existing mod_perl
wrapper -- but maybe there should be.

There are a number of perl scripts in the distribution, and I thought*
there was a plain Perl wrapper, but I could be mistaken.

I think a mod_perl frontend/wrapper could work well, that is, htsearch
is about 900K+ and takes a moment to fire up (on my box anyway) -- how
much worse could it be?

OTOH, one could* (conceivably) get crazy and access the DB's directly
and maybe XS any needed portions of htsearch (ambitious :-). However,
this still leaves htdig, htfuzzy, htmerge, etc .. to handle the
indexing.

As far as highlighting, I have a piece of code I'm using -- we could use
it as a starting point. Downside is it uses $` $' (it can probably be
tweeked to avoid this), but it handles the critical stuff like skipping
keywords within href's/tags, etc.

RE: Matt Sergeant -- Perhaps highlighting is overrated, but it usually
doesn't hurt. I too have a proprietary search facility, and a inverted
indexing prototype (stores packed doc-id integers in MySQL, for example)
-- but a great deal of work has gone into ht://dig ..


 My only concern is that it seems a little odd to keep this just to the
 Guide. Wouldn't it be useful for the rest of
 perl.apache.org? I wouldn't have thought it's much extra work to add a
 drop-down box to search specific areas of the sight
 (the Guide being one)...

I'd have to agree there.


 If there's a good reason to have the Guide's search engine separate to
 the rest of perl.apache.org, should it have a
 separate domain (modperlguide.org?, guide.perl.apache.org?)?
 
 --
 Jeremy Howard
 [EMAIL PROTECTED]

ht://dig allows for the param 'restrict' = /to_this_directory .. which
might be useful for seperating things.

Count me in, whatever we choose.

-Jay J

# use Text::Wrapper;



Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Keith G. Murphy

Jeremy Howard wrote:
 
 I'm glad you brought this up again. Since I mentioned I'd be happy to host such a 
thing, and asked for suggestions, I've got a total of one (from Stas--thanks!). That 
suggestion was to use ht://dig http://www.htdig.org/.
 
 Has anyone got a search engine up and running that they're happy with? Stas has made 
the good point that it needs to be able to hilight found words, since the pages are 
quite large. If anyone has a chance to do a bit of research about (free) search 
engines, I'd really appreciate it if you could let me know what you find out. It'd be 
nice publicity if it was mod_perl based, I guess, but it doesn't really matter.
 
 
I know this is absolute anathema, considering you guys are developers,
but...

Have you looked at www.atomz.com, at least as a temporary solution?  (A
free service for sites with fewer than 500 pages).  Basically, the
search brings up their page, but you can customize it to look just like
one of yours.  It truly is fast (as hell) and flexible, and it does
highlight found words.  Even does soundalikes in the absence of other
matches.  The result page will show their logo, though, but it's rather
unobtrusive.

(The biggest drawback, as a long-term solution, is that if you change
the look of your pages, you have one more maintenance chore to do, in
that you have to go over to atomz.com and change your result page there
as well).

O'Reilly uses it, if that helps!  :-)

Try this:

http://search.atomz.com/search/?sp-a=0002078e-spsp-q=cgisp-k=Books

(Looks for O'Reilly books pages containing 'cgi').

Yeah, I know, I'd rather roll my own, too, given time...



Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Jeremy Howard

 BTW: Your email client is broken and not wrapping words.

I know--sorry. I'm fixing that this week. I'm just going through the RFCs to see 
exactly how to implement this right... (The email client is a web-based thing I've 
written in mod_perl--of course ;-)
 
 I just wrote a very simple SQL based engine - so I would say I'm happy
 with that. It's fast and it's all in perl. I could very simply rip out the
 search parts of the code for someone to play with if they wanted to.

Sounds good. Personally, I'd rather a simple engine we can fiddle with ourselves than 
a big system written in C. Does your engine generate a database from flat files? Is 
there some basic parameterisation (a 'stop list' for common words, definable 'keyword' 
characters, ...)?
 
 I think word highlighting is overrated. It's only necessary in this case
 because the guide is so damn huge now. The size problem could be
 eliminated by making the guide split itself up into smaller sections. My
 proposal would be to do that by converting the guide to docbookXML and use
 AxKit to display the resulting docbook pages. The AxKit docbook
 stylesheets are nice and friendly, and written in Perl, not some obscure
 XML stylesheet language. And after all that, it would make converting the
 guide to a format O'Reilly likes to publish (i.e. docbook), trivial.
 
Your word highlighting statement is, I suspect, controversial. On the other hand, 
converting to docbook is unlikely to meet much resistance from users--as long as Stas 
doesn't mind maintaining it!... To get the best of both worlds, why not simply chain 
the search engine result through a filter that does the highlighting. I bet someone's 
written such a filter already--anyone?

  My only concern is that it seems a little odd to keep this just to the
  Guide. Wouldn't it be useful for the rest of perl.apache.org? I
  wouldn't have thought it's much extra work to add a drop-down box to
  search specific areas of the sight (the Guide being one)...
 
 perl.apache.org already has a search engine.
 
So I've heard, but:
*  Where is it? (doing a Find on the front page doesn't show it)
*  Does it do highlighting?
*  Can you select a subset of the site? (e.g. just the Guide)

  If there's a good reason to have the Guide's search engine separate to
  the rest of perl.apache.org, should it have a separate domain
  (modperlguide.org?, guide.perl.apache.org?)?
 
 guide.modperl.org ?
 
Looks like modperl.org is taken:

   Domain Name: MODPERL.ORG
   Registrar: NETWORK SOLUTIONS, INC.
   Whois Server: whois.networksolutions.com
   Referral URL: www.networksolutions.com
   Name Server: DNS2.BASCOM.COM
   Name Server: DNS.THAKKAR.NET
   Updated Date: 24-nov-1999

They're not using it though--maybe they would transfer? Probably better to stick in 
the perl.apache.org domain though.

BTW, thanks to everyone who's already responded privately to my renewed request. Keep 
it up!


-- 
  Jeremy Howard
  [EMAIL PROTECTED]



Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Matt Sergeant

On Wed, 17 May 2000, Jeremy Howard wrote:

  I just wrote a very simple SQL based engine - so I would say I'm happy
  with that. It's fast and it's all in perl. I could very simply rip out the
  search parts of the code for someone to play with if they wanted to.

 Sounds good. Personally, I'd rather a simple engine we can fiddle with
 ourselves than a big system written in C. Does your engine generate a
 database from flat files? Is there some basic parameterisation (a
 'stop list' for common words, definable 'keyword' characters, ...)?

Well it's just perl, so there's a separate word tokenizer, a separate db
inserter and a separate searcher (which is split into query parser and SQL
builder). The db inserter is aware of "ignore words" which are stored in
the DB.

  I think word highlighting is overrated. It's only necessary in this case
  because the guide is so damn huge now. The size problem could be
  eliminated by making the guide split itself up into smaller sections. My
  proposal would be to do that by converting the guide to docbookXML and use
  AxKit to display the resulting docbook pages. The AxKit docbook
  stylesheets are nice and friendly, and written in Perl, not some obscure
  XML stylesheet language. And after all that, it would make converting the
  guide to a format O'Reilly likes to publish (i.e. docbook), trivial.

 Your word highlighting statement is, I suspect, controversial. On the
 other hand, converting to docbook is unlikely to meet much resistance
 from users--as long as Stas doesn't mind maintaining it!... To get the
 best of both worlds, why not simply chain the search engine result
 through a filter that does the highlighting. I bet someone's written
 such a filter already--anyone?

   My only concern is that it seems a little odd to keep this just to the
   Guide. Wouldn't it be useful for the rest of perl.apache.org? I
   wouldn't have thought it's much extra work to add a drop-down box to
   search specific areas of the sight (the Guide being one)...
  
  perl.apache.org already has a search engine.
 
 So I've heard, but:
 *  Where is it? (doing a Find on the front page doesn't show it)

At the bottom of all guide pages.

 *  Does it do highlighting?

No.

 *  Can you select a subset of the site? (e.g. just the Guide)

No.

-- 
Matt/

Fastnet Software Ltd. High Performance Web Specialists
Providing mod_perl, XML, Sybase and Oracle solutions
Email for training and consultancy availability.
http://sergeant.org http://xml.sergeant.org




Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Robin Berjon

At 11:19 17/05/2000 -0500, Jeremy Howard wrote:
Your word highlighting statement is, I suspect, controversial. On the other 
hand, converting to docbook is unlikely to meet much resistance from 
users--as long as Stas doesn't mind maintaining it!... To get the best of 
both worlds, why not simply chain the search engine result through a filter 
that does the highlighting. I bet someone's written such a filter 
already--anyone?

I haven't played with it, but getting docbook out of the guide should be as
easy as using Pod::DocBook. Fwiw, there's also been some work done on
coming up with an xpod dtd, but I don't know how far it's advanced.



.Robin
To err is human, to purr feline.




Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Matt Sergeant

On Wed, 17 May 2000, Robin Berjon wrote:

 At 11:19 17/05/2000 -0500, Jeremy Howard wrote:
 Your word highlighting statement is, I suspect, controversial. On the other 
 hand, converting to docbook is unlikely to meet much resistance from 
 users--as long as Stas doesn't mind maintaining it!... To get the best of 
 both worlds, why not simply chain the search engine result through a filter 
 that does the highlighting. I bet someone's written such a filter 
 already--anyone?
 
 I haven't played with it, but getting docbook out of the guide should be as
 easy as using Pod::DocBook. Fwiw, there's also been some work done on
 coming up with an xpod dtd, but I don't know how far it's advanced.

I've played with Pod::DocBook, and it's a good start, but uses the DocBook
SGML DTD, so you can't process it with XML tools. It also doesn't support
=over =item =back, which is a pretty major limitation, IMHO. However
patching it to support that shouldn't be too hard.

-- 
Matt/

Fastnet Software Ltd. High Performance Web Specialists
Providing mod_perl, XML, Sybase and Oracle solutions
Email for training and consultancy availability.
http://sergeant.org http://xml.sergeant.org




Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Perrin Harkins

I know I'm late to this party, but I thought I'd point out a couple of
options:

- The Search::InvertedIndex module on CPAN (uses dbm files, I think).
- The DBIx::TextIndex module on CPAN (uses MySQL).
- The WAIT module on CPAN (uses dbm files).
- Glimpse: http://webglimpse.org/.
- Swish++: http://www.best.com/~pjl/software/swish/ (no, it's not the same
one apache.org is using).

I've also had great success with htdig.  Maybe I'll try spidering the
guide with and see how it does.

- Perrin




Re: Guide search engine (was Re: multiple copies of a module)

2000-05-17 Thread Jeremy Howard

...the perl.apache.org search facility
  *  Where is it? (doing a Find on the front page doesn't show it)
 
 At the bottom of all guide pages.
 
How funny--I'd never even noticed it!

I see that it's using 'Swish-E' http://sunsite.berkeley.edu/SWISH-E/. Stas--did you 
get that up and running? Can we tailor it for our needs?

Here's an attempt at listing what I think we've decided we should aim for:
- Allow restriction of search to just the guide
- Allow searching of other documents through a popup selection (probably make the 
guide the default?)
- Highlight found words
- Try and index in a way that suits programmers, not English writers. e.g. include @, 
%, $, ::, in indexed words.

Have I missed anything? (I'm ignoring the docbook issue for the moment since it's not 
directly related, and I guess it's really Stas' call anyhow.)

I would have thought the best bet would be to put it on the footer of every 
perl.apache.org page. A popup which allows selecting a subset of the site might 
default to either 'whole site' or 'mod_perl Guide', or maybe it changes it's default 
to whatever part of the site is currently being viewed...

The outstanding issues, I believe, are:
- Who looks after the perl.apache.org search facility? Are they happy to expand its 
functionality as described?
- What tool? Potential options so far are Swish-e, htdig, or custom Perl (perhaps 
based on Matt's engine). Any of these could be piped through a word-hilighting filter
- What's the best 1st step? i.e. How can we get a simple search going quickly, while 
providing the foundation for a more complete system down the track?
- Who's going to do the actual work? As I've mentioned, if a machine is required, I'm 
happy to provide it. However, I don't have the experience in this area to lead the 
work--although of course I'll contribute where I can! It would be nice to get a 
private mailing list going to avoid filling up this list too much more.

Anyone who's interesting in getting involved, email me, and I'll ensure that I add 
your name to the list. You don't have to be a programming guru, of course... there's 
always plenty of ways to get involved in these things.

-- 
  Jeremy Howard
  [EMAIL PROTECTED]