Re: [htdig3-dev] Shared libs

Geoff Hutchison Sat, 17 Jul 1999 18:57:36 -0700


>        . the indexer
>        . the crawler
>        . the document parser logic -> specific parsers

At the moment, these are not really logically separate. The third is
*close* to separate, but right now, the crawler and indexer are very
tightly bound. That will be loosened up a bit when the Transport separates
the *connection* portion of the crawler from the rest of the code.

I'm not entirely certain breaking the crawler and the indexer into
separate pieces is a good thing. Clearly it would be useful to have the
crawler be relatively independent of the connection libraries. This would
enable a wider variety of protocols (WAIS, HTTPS, FTP and NNTP come to
mind).

However, a fair chunk of research on IR demonstrates the utility of
including information on the link structure as part of the index. I took a
tentitive step in this direction with the storage of the link text
(descriptions in DocumentRef.cc) and the controversial backlink_factor.
But these make it much more complicated to separate the crawler and
indexer.

>        . the query solver

By this, I assume you mean the steps of looking up the words,
performing requisite logic on the lists and forming lists of ranked
results? I might even break this into a few parts--after all, each
individual Fuzzy algorithm is essentially a "plug-in."

Solver (as it stands now, after the query is parsed into words)
 1) Fuzzy transformations (expands wordlist into a weighted wordlist)
 2) Boolean transformations (soon to include phrase)
 3) Sorting (and Filtering)
 4) Document Retrieval

I would say that #1 is very good already. It would be nice if the fuzzy
algorithm could weight its responses as well (e.g. 0 - 1 for no match to
exact match).

#2 would be nice to modularize since it could allow for fancy
transformations.

#3 would be a fantastically powerful module since it could offer
restricts, excludes and/or sorting on just about anything and probably in
any combination. :-)

>        . the user interface -> HTML, ...

This should almost definitely be separated. Wouldn't it be nice if
htnotify could send messages that were templates? Or if you could query
htsearch/htnotify to find out if a document has been modified?

>        Of course a very important step to fully take advantage of this
> is to define interfaces for each library. But this can be done
> afterwards, it does not prevent us from implementing the mechanism. We
> have good working examples of this in gimp or dia. It will mainly
> require the following:

Last summer when I took over, there was much discussion of new APIs. It
would be very nice to allow more support for external utilities to "plug
in," say as a new filter for search results, or to support a new transport
protocol (HTTPS anyone?).

-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig3-dev] Shared libs

Reply via email to