Geoff Hutchison writes:
>
> I'm not entirely certain breaking the crawler and the indexer into
> separate pieces is a good thing. Clearly it would be useful to have the
> crawler be relatively independent of the connection libraries. This would
> enable a wider variety of protocols (WAIS, HTTPS, FTP and NNTP come to
> mind).
Exactly.
> However, a fair chunk of research on IR demonstrates the utility of
> including information on the link structure as part of the index. I took a
> tentitive step in this direction with the storage of the link text
> (descriptions in DocumentRef.cc) and the controversial backlink_factor.
> But these make it much more complicated to separate the crawler and
> indexer.
That basically means that the interface between the crawler (or document
feeder if we want to see it in a more general way) has to allow this kind
of information to be passed to the indexer. I don't have a ready to use,
smart and clear interface to propose at present. But I guess the interface
will slowly appear if we start to separate the indexer and the crawler.
If we cannot imagine an interface between the crawler and the indexer
regarding the link information at present, we can have the indexer and
crawler tightly bound for that purpose.
I'll be more verbose on the interface after reading the existing code.
> > . the query solver
>
> By this, I assume you mean the steps of looking up the words,
> performing requisite logic on the lists and forming lists of ranked
> results?
Yes.
> I might even break this into a few parts--after all, each
> individual Fuzzy algorithm is essentially a "plug-in."
>
> Solver (as it stands now, after the query is parsed into words)
> 1) Fuzzy transformations (expands wordlist into a weighted wordlist)
> 2) Boolean transformations (soon to include phrase)
> 3) Sorting (and Filtering)
> 4) Document Retrieval
>
> I would say that #1 is very good already. It would be nice if the fuzzy
> algorithm could weight its responses as well (e.g. 0 - 1 for no match to
> exact match).
>
> #2 would be nice to modularize since it could allow for fancy
> transformations.
>
> #3 would be a fantastically powerful module since it could offer
> restricts, excludes and/or sorting on just about anything and probably in
> any combination. :-)
Yes. Here is how I see the dependencies between those modules:
[ in : User
-> Query parser (htdig Boolean, AltaVista simple, AltaVista advanced ..)
-> out: normalized syntax tree ]
[ in : normalized syntax tree
-> Query transform { Fuzzy, Thesaurus } ( may be Fuzzy + Thesaurus )
out: normalized syntax tree ]
[ in : normalized syntax tree
-> Query solver (relevance ranked and sorted)
out : list of documents identifiers ]
[ in : list of documents identifies
-> Document retrieval (extract document information)
out : list of document information ]
I'm not sure where the document filtering should go. Since my main concern
is to have good performances on huge database (millions of documents, gigabytes
of indexes), I think that the 'Query solver' step (looking up the words and
performing the requisite logic) is critical.
1) The document list associated with each word must not be loaded
in memory.
2) The resolution of the query must not return a list of all
possible documents matching the query. It must perform the
relevance ranking and sorting while searching the index and
only return the first N documents.
These restrictions are essential for a large scale index that is
used under high stress. The preriquisite is quite clear: the index
must be organized to allow this (document lists for each word must be
sorted to allow ranking/sorting without reading the whole document list). If it
is not, all the list must be read.
This seems a bit idealistic but is not. All full text indexes have a
defined usage pattern (95% of the queries on internet search engines only
include 1 or 2 words, without boolean operators, for instance). The administrator
of the search engine must have the ability to specify the way document identifiers
are sorted for each word according to this very common usage.
My explanations are a bit vague (you'll find more on the mifluz documentation).
It requires test/try. And that's why I think it's important to separate things. If
the indexer is separated from the crawler and the query resolution clearly separate
the step that depend on the structure of the full text index, we can plug different
implementations of the index merely by changing two libraries.
> > . the user interface -> HTML, ...
>
> This should almost definitely be separated. Wouldn't it be nice if
> htnotify could send messages that were templates? Or if you could query
> htsearch/htnotify to find out if a document has been modified?
Yes :-)
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.