Hi,
> Wow! I wish you had been able to respond earlier! I think we can
Yes, the mail was delayed for no reason... too bad.
> definitely do some work together. I'm currently approaching the problem
> from the other direction. Rather than writing new code that replicates what
> htdig already does, I'm recoding parts of htsearch to be more
> object-oriented, and was planning on hooking into the classes from XS. It
> sounds like you're getting much more low-level than I am right now.
Yes again. And this will be nice to get both ends to reach and work together.
I strongly think we have to make plans together. I'm going to work on the
structure of the word database and the classes to manipulate it until we
have something that fulfill the search needs.
We must first define where we are going to prevent incompatible or
redundant code. Could you, please comment what you've done and what you'll
do and how long it will take you ? Appart from planning we must also agree
on data format. Separating the parse and search phase requires a well
defined data structure to communicate between the two. Talking on the list
will be usefull but code will talk too :-) I suggest (comments Geoff ?)
creating a temporary CVS branch for that purpose only. We will be able to
commit ugly and non working code without bothering the main branch until
we're done. Committing daily will prevent to step on each other toes.
Here is a small description of the current situation (I hope you have
the latest update, I've changed WordReference class a lot last
week, there are many comments that explain why and how in the code) and
my next changes.
-----------------------------
Data:
. The word database
Key: containing Word, DocID, Flags, Location
Record: containing Anchor
I DO: Add an entry for each distinct word that contains statistics
about the word. Most important : word overall frequency. This
is critical for search performances.
. The document database
DO NOTHING: (? you agree). If you plan to work on that too be aware
that the next evolution is to use an SQL database to store
this. Using Berkeley DB is a pain for this purpose. Fits
*very* well the needs of the word inverted index but is
definitely not what is needed for the document database.
Functions:
a The word insertion/update/delete (indexing)
I DO: write test, fix the delete make sure htmerge is not needed anymore,
all that purely dynamic, update statistics.
b The document parsing
DO NOTHING (? you agree)
c The search query parsing (building a query syntax tree)
YOU DO: ? which syntax ? structure of the syntax tree ? I strongly
advocate for AltaVista syntax + syntax tree able to contain
all what is needed for AltaVista syntax (simple + advanced) to
work. htdig syntax can/should be transltated to the syntax tree.
If the syntax tree is not powerfull enough to handle AltaVista
syntax we lose a big opportunity.
d The query resolution (using the syntax tree to match words)
WE DO: There is a number of
constraints we definitely want to match here : the memory space,
cpu time and I/O used to resolve a query must grow linearly as
a function of the number of terms and complexity of the query.
The linear factor must be as small as possible. To achieve that
I basically have *two* ideas : all search terms must be searched
in parallel and least frequent terms must always be considered
first. The WordList::Walk method allows traversal of the list
and must be used instead of Find that returns the whole list
of matching words. Using Find is a killer for big indexes (think
retrieving all the occurences of 'the' in a 1 million document
database :-). The new strucuture of the WordKey class also helps.
It's fast and easy to say : search this word in this document
because the document id is now part of the key.
d The information retrieval (given top N matches for a query
retrieve the relevant document information)
DO NOTHING (? you agree)
e The information display
DO NOTHING (? you agree)
--------------------------------------
I've commited the hardest part for 'a' and hope to finish it by the end
of the week (understand end of next week :-). I'm still concerned about the
fact that WordReference is still tighted to a specific database structure
and will eventually switch to an abstract implementation. But not for this
version.
To summarize we have to :
. Make a rough planning of action (the list above may be a start)
. Define the structure of the syntax tree (I'm ready to write a
proposal in next mail. IMHO we have to think about a structure
that will map well to Perl).
. Create a branch on CVS share partial work
> As guess as far as source, I'll show you mine if you'll show me yours!
> Mine's not really ready for even casual viewing yet, because as you stated,
> the intermingling is pretty ugly and it's a bitch getting the functionality
> separated out into classes.
Well, you have all the source I wrote in the CVS tree, let see yours :-)
Cheers,
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.