On Wed, 28 Feb 2001, Roee Rubin wrote:

> Hello,
> 
> I have to write a search engine for a website which consists of mainly
> static HTML pages.
> 
> There are a number of packages that provide this, such as Text::Search and
> DBIx::FullTextSearch , but I would like to get some input before I begin.
>

I'm currently in the process of writing a search engine. It sounds like
you plan to search the content of the pages when a search request is
received. The web sites that I plan to use my search engine on have too
many pages and too large sized files to do a search in a reasonable amount
of time.  My search engine is based on a fully inverted word index, i.e. 
for each unique word on the website, there will be a pointer to each page
that contains that word. To find word matches, I do a binary search on the
word index. It's really fast. I've not yet decided if I will provide an
ability to search for a phrase. If I do, I'll do an AND search on each
non-noise word of the phrase to get a list of the file's that _may_
contain the phrase then I'll do a search of the content of those pages for
the phrase. I'll probably limit the phrase search to .txt and .html pages
since the PDF text extractor I'm using tends to produce word strings
out-of-order. The out-of-order problem is not a problem when I'm trying to
simply extract all the words from a .pdf document, but it would present a
problem when trying to match the multiple words of a phrase in order.

**** [EMAIL PROTECTED] <Carl Jolley> 
**** All opinions are my own and not necessarily those of my employer ****

_______________________________________________
Perl-Win32-Web mailing list
[EMAIL PROTECTED]
http://listserv.ActiveState.com/mailman/listinfo/perl-win32-web

Reply via email to