Re: Guide search engine (was Re: multiple copies of a module)
On Wed, 17 May 2000, Jeremy Howard wrote: ...the perl.apache.org search facility * Where is it? (doing a Find on the front page doesn't show it) At the bottom of all guide pages. How funny--I'd never even noticed it! I see that it's using 'Swish-E' http://sunsite.berkeley.edu/SWISH-E/. Stas--did you get that up and running? Can we tailor it for our needs? Here's an attempt at listing what I think we've decided we should aim for: - Allow restriction of search to just the guide - Allow searching of other documents through a popup selection (probably make the guide the default?) - Highlight found words - Try and index in a way that suits programmers, not English writers. e.g. include @, %, $, ::, in indexed words. Have I missed anything? (I'm ignoring the docbook issue for the moment since it's not directly related, and I guess it's really Stas' call anyhow.) So far these are the engines that we are going to deplo: 1st search: Randy Kobes: swish engine + perl filters http://theoryx5.uwinnipeg.ca/cgi-bin/guide-search 2nd search: Vivek Khera: nextrieve engine http://thingy.kcilink.com/cgi-bin/modperlguide.cgi Both more or less cover the demands from my yours and mine wishlists. I'll link to these from the Guide. You are welcome to present other search engines if you think you can get a better one. I promise to link to all of them, assuming that you will take the responsibility to keep up with updates. I'll delete all the references to search engines which will not update their indexed version as I did before with some quite good search engine that didn't keep up with updates and had half a year old version, and users were using a *very* outdated guide as a result. I would have thought the best bet would be to put it on the footer of every perl.apache.org page. A popup which allows selecting a subset of the site might default to either 'whole site' or 'mod_perl Guide', or maybe it changes it's default to whatever part of the site is currently being viewed... The outstanding issues, I believe, are: - Who looks after the perl.apache.org search facility? Are they happy to expand its functionality as described? - What tool? Potential options so far are Swish-e, htdig, or custom Perl (perhaps based on Matt's engine). Any of these could be piped through a word-hilighting filter - What's the best 1st step? i.e. How can we get a simple search going quickly, while providing the foundation for a more complete system down the track? - Who's going to do the actual work? As I've mentioned, if a machine is required, I'm happy to provide it. However, I don't have the experience in this area to lead the work--although of course I'll contribute where I can! It would be nice to get a private mailing list going to avoid filling up this list too much more. Anyone who's interesting in getting involved, email me, and I'll ensure that I add your name to the list. You don't have to be a programming guru, of course... there's always plenty of ways to get involved in these things. Well, things are just happening. Vivek and Randy already created their versions and presented them at the list, received feedbacks, made corrections and have the engines working. As I've mentioned above you are very welcome to beat their achievement and get an even better engine :) P.S. Asking "who is going to do that" is a bad idea on this list... I'm not bitching, Just telling the fact. If you want something to be done either ask for help or do it yourself. _ Stas Bekman JAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide http://perl.apache.org/guide mailto:[EMAIL PROTECTED] http://perl.org http://stason.org/TULARC http://singlesheaven.com http://perlmonth.com http://sourcegarden.org
Re: Guide search engine (was Re: multiple copies of a module)
BTW: Your email client is broken and not wrapping words. On Wed, 17 May 2000, Jeremy Howard wrote: Stas Bekman wrote: Hold on, at this very moment a few mod_perl fellas are working on having a good search engine for the guide. Just give it some more time, I'm trying to bring the best so it'll take a while... I'm glad you brought this up again. Since I mentioned I'd be happy to host such a thing, and asked for suggestions, I've got a total of one (from Stas--thanks!). That suggestion was to use ht://dig http://www.htdig.org/. While htdig is a reasonable engine, Stas's idea is this needs to be "guide specific". Meaning what I'm not sure, but I'm assuming it means to pick out only certain words to index... Has anyone got a search engine up and running that they're happy with? I just wrote a very simple SQL based engine - so I would say I'm happy with that. It's fast and it's all in perl. I could very simply rip out the search parts of the code for someone to play with if they wanted to. Stas has made the good point that it needs to be able to hilight found words, since the pages are quite large. If anyone has a chance to do a bit of research about (free) search engines, I'd really appreciate it if you could let me know what you find out. It'd be nice publicity if it was mod_perl based, I guess, but it doesn't really matter. I think word highlighting is overrated. It's only necessary in this case because the guide is so damn huge now. The size problem could be eliminated by making the guide split itself up into smaller sections. My proposal would be to do that by converting the guide to docbookXML and use AxKit to display the resulting docbook pages. The AxKit docbook stylesheets are nice and friendly, and written in Perl, not some obscure XML stylesheet language. And after all that, it would make converting the guide to a format O'Reilly likes to publish (i.e. docbook), trivial. My only concern is that it seems a little odd to keep this just to the Guide. Wouldn't it be useful for the rest of perl.apache.org? I wouldn't have thought it's much extra work to add a drop-down box to search specific areas of the sight (the Guide being one)... perl.apache.org already has a search engine. If there's a good reason to have the Guide's search engine separate to the rest of perl.apache.org, should it have a separate domain (modperlguide.org?, guide.perl.apache.org?)? guide.modperl.org ? -- Matt/ Fastnet Software Ltd. High Performance Web Specialists Providing mod_perl, XML, Sybase and Oracle solutions Email for training and consultancy availability. http://sergeant.org http://xml.sergeant.org
Re: Guide search engine (was Re: multiple copies of a module)
Jeremy Howard wrote: I'm glad you brought this up again. Since I mentioned I'd be happy to host such a thing, and asked for suggestions, I've got a total of one (from Stas--thanks!). That suggestion was to use ht://dig http://www.htdig.org/. Has anyone got a search engine up and running that they're happy with? Stas has made the good point that it needs to be able to hilight found words, since the pages are quite large. If anyone has a chance to do a bit of research about (free) search engines, I'd really appreciate it if you could let me know what you find out. It'd be nice publicity if it was mod_perl based, I guess, but it doesn't really matter. I'm happy with ht://dig, I use it mainly for looking up docs I've squirreled away in /manual. (instead of grep) It's been a while since I've been to htdig.org but I did grab a tarball recently, so I'm fairly confident there isn't* an existing mod_perl wrapper -- but maybe there should be. There are a number of perl scripts in the distribution, and I thought* there was a plain Perl wrapper, but I could be mistaken. I think a mod_perl frontend/wrapper could work well, that is, htsearch is about 900K+ and takes a moment to fire up (on my box anyway) -- how much worse could it be? OTOH, one could* (conceivably) get crazy and access the DB's directly and maybe XS any needed portions of htsearch (ambitious :-). However, this still leaves htdig, htfuzzy, htmerge, etc .. to handle the indexing. As far as highlighting, I have a piece of code I'm using -- we could use it as a starting point. Downside is it uses $` $' (it can probably be tweeked to avoid this), but it handles the critical stuff like skipping keywords within href's/tags, etc. RE: Matt Sergeant -- Perhaps highlighting is overrated, but it usually doesn't hurt. I too have a proprietary search facility, and a inverted indexing prototype (stores packed doc-id integers in MySQL, for example) -- but a great deal of work has gone into ht://dig .. My only concern is that it seems a little odd to keep this just to the Guide. Wouldn't it be useful for the rest of perl.apache.org? I wouldn't have thought it's much extra work to add a drop-down box to search specific areas of the sight (the Guide being one)... I'd have to agree there. If there's a good reason to have the Guide's search engine separate to the rest of perl.apache.org, should it have a separate domain (modperlguide.org?, guide.perl.apache.org?)? -- Jeremy Howard [EMAIL PROTECTED] ht://dig allows for the param 'restrict' = /to_this_directory .. which might be useful for seperating things. Count me in, whatever we choose. -Jay J # use Text::Wrapper;
Re: Guide search engine (was Re: multiple copies of a module)
Jeremy Howard wrote: I'm glad you brought this up again. Since I mentioned I'd be happy to host such a thing, and asked for suggestions, I've got a total of one (from Stas--thanks!). That suggestion was to use ht://dig http://www.htdig.org/. Has anyone got a search engine up and running that they're happy with? Stas has made the good point that it needs to be able to hilight found words, since the pages are quite large. If anyone has a chance to do a bit of research about (free) search engines, I'd really appreciate it if you could let me know what you find out. It'd be nice publicity if it was mod_perl based, I guess, but it doesn't really matter. I know this is absolute anathema, considering you guys are developers, but... Have you looked at www.atomz.com, at least as a temporary solution? (A free service for sites with fewer than 500 pages). Basically, the search brings up their page, but you can customize it to look just like one of yours. It truly is fast (as hell) and flexible, and it does highlight found words. Even does soundalikes in the absence of other matches. The result page will show their logo, though, but it's rather unobtrusive. (The biggest drawback, as a long-term solution, is that if you change the look of your pages, you have one more maintenance chore to do, in that you have to go over to atomz.com and change your result page there as well). O'Reilly uses it, if that helps! :-) Try this: http://search.atomz.com/search/?sp-a=0002078e-spsp-q=cgisp-k=Books (Looks for O'Reilly books pages containing 'cgi'). Yeah, I know, I'd rather roll my own, too, given time...
Re: Guide search engine (was Re: multiple copies of a module)
BTW: Your email client is broken and not wrapping words. I know--sorry. I'm fixing that this week. I'm just going through the RFCs to see exactly how to implement this right... (The email client is a web-based thing I've written in mod_perl--of course ;-) I just wrote a very simple SQL based engine - so I would say I'm happy with that. It's fast and it's all in perl. I could very simply rip out the search parts of the code for someone to play with if they wanted to. Sounds good. Personally, I'd rather a simple engine we can fiddle with ourselves than a big system written in C. Does your engine generate a database from flat files? Is there some basic parameterisation (a 'stop list' for common words, definable 'keyword' characters, ...)? I think word highlighting is overrated. It's only necessary in this case because the guide is so damn huge now. The size problem could be eliminated by making the guide split itself up into smaller sections. My proposal would be to do that by converting the guide to docbookXML and use AxKit to display the resulting docbook pages. The AxKit docbook stylesheets are nice and friendly, and written in Perl, not some obscure XML stylesheet language. And after all that, it would make converting the guide to a format O'Reilly likes to publish (i.e. docbook), trivial. Your word highlighting statement is, I suspect, controversial. On the other hand, converting to docbook is unlikely to meet much resistance from users--as long as Stas doesn't mind maintaining it!... To get the best of both worlds, why not simply chain the search engine result through a filter that does the highlighting. I bet someone's written such a filter already--anyone? My only concern is that it seems a little odd to keep this just to the Guide. Wouldn't it be useful for the rest of perl.apache.org? I wouldn't have thought it's much extra work to add a drop-down box to search specific areas of the sight (the Guide being one)... perl.apache.org already has a search engine. So I've heard, but: * Where is it? (doing a Find on the front page doesn't show it) * Does it do highlighting? * Can you select a subset of the site? (e.g. just the Guide) If there's a good reason to have the Guide's search engine separate to the rest of perl.apache.org, should it have a separate domain (modperlguide.org?, guide.perl.apache.org?)? guide.modperl.org ? Looks like modperl.org is taken: Domain Name: MODPERL.ORG Registrar: NETWORK SOLUTIONS, INC. Whois Server: whois.networksolutions.com Referral URL: www.networksolutions.com Name Server: DNS2.BASCOM.COM Name Server: DNS.THAKKAR.NET Updated Date: 24-nov-1999 They're not using it though--maybe they would transfer? Probably better to stick in the perl.apache.org domain though. BTW, thanks to everyone who's already responded privately to my renewed request. Keep it up! -- Jeremy Howard [EMAIL PROTECTED]
Re: Guide search engine (was Re: multiple copies of a module)
On Wed, 17 May 2000, Jeremy Howard wrote: I just wrote a very simple SQL based engine - so I would say I'm happy with that. It's fast and it's all in perl. I could very simply rip out the search parts of the code for someone to play with if they wanted to. Sounds good. Personally, I'd rather a simple engine we can fiddle with ourselves than a big system written in C. Does your engine generate a database from flat files? Is there some basic parameterisation (a 'stop list' for common words, definable 'keyword' characters, ...)? Well it's just perl, so there's a separate word tokenizer, a separate db inserter and a separate searcher (which is split into query parser and SQL builder). The db inserter is aware of "ignore words" which are stored in the DB. I think word highlighting is overrated. It's only necessary in this case because the guide is so damn huge now. The size problem could be eliminated by making the guide split itself up into smaller sections. My proposal would be to do that by converting the guide to docbookXML and use AxKit to display the resulting docbook pages. The AxKit docbook stylesheets are nice and friendly, and written in Perl, not some obscure XML stylesheet language. And after all that, it would make converting the guide to a format O'Reilly likes to publish (i.e. docbook), trivial. Your word highlighting statement is, I suspect, controversial. On the other hand, converting to docbook is unlikely to meet much resistance from users--as long as Stas doesn't mind maintaining it!... To get the best of both worlds, why not simply chain the search engine result through a filter that does the highlighting. I bet someone's written such a filter already--anyone? My only concern is that it seems a little odd to keep this just to the Guide. Wouldn't it be useful for the rest of perl.apache.org? I wouldn't have thought it's much extra work to add a drop-down box to search specific areas of the sight (the Guide being one)... perl.apache.org already has a search engine. So I've heard, but: * Where is it? (doing a Find on the front page doesn't show it) At the bottom of all guide pages. * Does it do highlighting? No. * Can you select a subset of the site? (e.g. just the Guide) No. -- Matt/ Fastnet Software Ltd. High Performance Web Specialists Providing mod_perl, XML, Sybase and Oracle solutions Email for training and consultancy availability. http://sergeant.org http://xml.sergeant.org
Re: Guide search engine (was Re: multiple copies of a module)
At 11:19 17/05/2000 -0500, Jeremy Howard wrote: Your word highlighting statement is, I suspect, controversial. On the other hand, converting to docbook is unlikely to meet much resistance from users--as long as Stas doesn't mind maintaining it!... To get the best of both worlds, why not simply chain the search engine result through a filter that does the highlighting. I bet someone's written such a filter already--anyone? I haven't played with it, but getting docbook out of the guide should be as easy as using Pod::DocBook. Fwiw, there's also been some work done on coming up with an xpod dtd, but I don't know how far it's advanced. .Robin To err is human, to purr feline.
Re: Guide search engine (was Re: multiple copies of a module)
On Wed, 17 May 2000, Robin Berjon wrote: At 11:19 17/05/2000 -0500, Jeremy Howard wrote: Your word highlighting statement is, I suspect, controversial. On the other hand, converting to docbook is unlikely to meet much resistance from users--as long as Stas doesn't mind maintaining it!... To get the best of both worlds, why not simply chain the search engine result through a filter that does the highlighting. I bet someone's written such a filter already--anyone? I haven't played with it, but getting docbook out of the guide should be as easy as using Pod::DocBook. Fwiw, there's also been some work done on coming up with an xpod dtd, but I don't know how far it's advanced. I've played with Pod::DocBook, and it's a good start, but uses the DocBook SGML DTD, so you can't process it with XML tools. It also doesn't support =over =item =back, which is a pretty major limitation, IMHO. However patching it to support that shouldn't be too hard. -- Matt/ Fastnet Software Ltd. High Performance Web Specialists Providing mod_perl, XML, Sybase and Oracle solutions Email for training and consultancy availability. http://sergeant.org http://xml.sergeant.org
Re: Guide search engine (was Re: multiple copies of a module)
I know I'm late to this party, but I thought I'd point out a couple of options: - The Search::InvertedIndex module on CPAN (uses dbm files, I think). - The DBIx::TextIndex module on CPAN (uses MySQL). - The WAIT module on CPAN (uses dbm files). - Glimpse: http://webglimpse.org/. - Swish++: http://www.best.com/~pjl/software/swish/ (no, it's not the same one apache.org is using). I've also had great success with htdig. Maybe I'll try spidering the guide with and see how it does. - Perrin
Re: Guide search engine (was Re: multiple copies of a module)
...the perl.apache.org search facility * Where is it? (doing a Find on the front page doesn't show it) At the bottom of all guide pages. How funny--I'd never even noticed it! I see that it's using 'Swish-E' http://sunsite.berkeley.edu/SWISH-E/. Stas--did you get that up and running? Can we tailor it for our needs? Here's an attempt at listing what I think we've decided we should aim for: - Allow restriction of search to just the guide - Allow searching of other documents through a popup selection (probably make the guide the default?) - Highlight found words - Try and index in a way that suits programmers, not English writers. e.g. include @, %, $, ::, in indexed words. Have I missed anything? (I'm ignoring the docbook issue for the moment since it's not directly related, and I guess it's really Stas' call anyhow.) I would have thought the best bet would be to put it on the footer of every perl.apache.org page. A popup which allows selecting a subset of the site might default to either 'whole site' or 'mod_perl Guide', or maybe it changes it's default to whatever part of the site is currently being viewed... The outstanding issues, I believe, are: - Who looks after the perl.apache.org search facility? Are they happy to expand its functionality as described? - What tool? Potential options so far are Swish-e, htdig, or custom Perl (perhaps based on Matt's engine). Any of these could be piped through a word-hilighting filter - What's the best 1st step? i.e. How can we get a simple search going quickly, while providing the foundation for a more complete system down the track? - Who's going to do the actual work? As I've mentioned, if a machine is required, I'm happy to provide it. However, I don't have the experience in this area to lead the work--although of course I'll contribute where I can! It would be nice to get a private mailing list going to avoid filling up this list too much more. Anyone who's interesting in getting involved, email me, and I'll ensure that I add your name to the list. You don't have to be a programming guru, of course... there's always plenty of ways to get involved in these things. -- Jeremy Howard [EMAIL PROTECTED]