Leslie Mikesell wrote:
>
> According to jmoore:
> > What I propose is to store the words before and after each word in the
> > index.
> > -------+-----------------------------------------
> > <Word> | <Previous Word><Next Word>
>
> > The main problem with this approach as outlined, is that the index will be
> > at least 3 times the size of the collected documents since the previous
> > and next word is stored for each word.
>
> But worse, you now have to store copies of every unique 3 word
> sequence in the document instead of just unique words, so frequently
> mentioned words will expand the index even more. And you still
> won't know if one 2 word sequence links up with another to complete
> a 4 word phrase. Could you instead store a list of positions within
> the document where each word appears? Then after looking up the
> potential matches containing all the words, discard the ones where
> you can't assemble a consecutive list of position numbers matching
> the words in the phrase.
Interesting ideas. My plan to support phrase searching in ht://Dig 4 would
add a word position to each word (first word in the document is word #0,
second one #1, etc.)
So, the word table looks something like this:
create table word
(
word varchar(<wordlength>), // The word in question
docid int, // reference to the document of the word
index int, // location of word in doc, starting at 0
anchorid int, // references to nearest named anchor
context byte // context of word (<h1> vs. <title>, etc)
);
The only thing I am not sure about is whether to put the anchorid field in the
word table or create a separate table that maps documents to anchors and their
index in the document. It would save the 4 bytes per word record and still
provide the anchor information.
With this scheme, several types of searches become possible:
* phrase searches
* "near" searches
* "before" or "after" searches
* etc.
The only disadvantage is the space complexity compared to the GDBM variable
length record approach currently used in ht://Dig.
This is the main reason I want to move away from a hash-based database like
GDBM; this approach requires non-unique keys.
Thoughts/comments?
--
Andrew Scherpbier <[EMAIL PROTECTED]>
Contigo Software <http://www.contigo.com/>
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.