Leslie Mikesell wrote:
> 
> According to jmoore:
> > What I propose is to store the words before and after each word in the
> > index.
> >   -------+-----------------------------------------
> >   <Word> |  <Previous Word><Next Word>
> 
> > The main problem with this approach as outlined, is that the index will be
> > at least 3 times the size of the collected documents since the previous
> > and next word is stored for each word.
> 
> But worse, you now have to store copies of every unique 3 word
> sequence in the document instead of just unique words, so frequently
> mentioned words will expand the index even more.  And you still
> won't know if one 2 word sequence links up with another to complete
> a 4 word phrase.  Could you instead store a list of positions within
> the document where each word appears?  Then after looking up the
> potential matches containing all the words, discard the ones where
> you can't assemble a consecutive list of position numbers matching
> the words in the phrase.

Interesting ideas.  My plan to support phrase searching in ht://Dig 4 would
add a word position to each word (first word in the document is word #0,
second one #1, etc.)
So, the word table looks something like this:

create table word
(
    word varchar(<wordlength>), // The word in question
    docid int,                  // reference to the document of the word
    index int,                  // location of word in doc, starting at 0
    anchorid int,               // references to nearest named anchor
    context byte                // context of word (<h1> vs. <title>, etc)
);

The only thing I am not sure about is whether to put the anchorid field in the
word table or create a separate table that maps documents to anchors and their
index in the document.  It would save the 4 bytes per word record and still
provide the anchor information.

With this scheme, several types of searches become possible:
*  phrase searches
*  "near" searches
*  "before" or "after" searches
*  etc.

The only disadvantage is the space complexity compared to the GDBM variable
length record approach currently used in ht://Dig.
This is the main reason I want to move away from a hash-based database like
GDBM; this approach requires non-unique keys.

Thoughts/comments?

-- 
Andrew Scherpbier <[EMAIL PROTECTED]>
Contigo Software <http://www.contigo.com/>
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to