> . the indexer
> . the crawler
> . the document parser logic -> specific parsers
At the moment, these are not really logically separate. The third is
*close* to separate, but right now, the crawler and indexer are very
tightly bound. That will be loosened up a bit when the Transport separates
the *connection* portion of the crawler from the rest of the code.
I'm not entirely certain breaking the crawler and the indexer into
separate pieces is a good thing. Clearly it would be useful to have the
crawler be relatively independent of the connection libraries. This would
enable a wider variety of protocols (WAIS, HTTPS, FTP and NNTP come to
mind).
However, a fair chunk of research on IR demonstrates the utility of
including information on the link structure as part of the index. I took a
tentitive step in this direction with the storage of the link text
(descriptions in DocumentRef.cc) and the controversial backlink_factor.
But these make it much more complicated to separate the crawler and
indexer.
> . the query solver
By this, I assume you mean the steps of looking up the words,
performing requisite logic on the lists and forming lists of ranked
results? I might even break this into a few parts--after all, each
individual Fuzzy algorithm is essentially a "plug-in."
Solver (as it stands now, after the query is parsed into words)
1) Fuzzy transformations (expands wordlist into a weighted wordlist)
2) Boolean transformations (soon to include phrase)
3) Sorting (and Filtering)
4) Document Retrieval
I would say that #1 is very good already. It would be nice if the fuzzy
algorithm could weight its responses as well (e.g. 0 - 1 for no match to
exact match).
#2 would be nice to modularize since it could allow for fancy
transformations.
#3 would be a fantastically powerful module since it could offer
restricts, excludes and/or sorting on just about anything and probably in
any combination. :-)
> . the user interface -> HTML, ...
This should almost definitely be separated. Wouldn't it be nice if
htnotify could send messages that were templates? Or if you could query
htsearch/htnotify to find out if a document has been modified?
> Of course a very important step to fully take advantage of this
> is to define interfaces for each library. But this can be done
> afterwards, it does not prevent us from implementing the mechanism. We
> have good working examples of this in gimp or dia. It will mainly
> require the following:
Last summer when I took over, there was much discussion of new APIs. It
would be very nice to allow more support for external utilities to "plug
in," say as a new filter for search results, or to support a new transport
protocol (HTTPS anyone?).
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.