According to jmoore:
> What I propose is to store the words before and after each word in the
> index. 
>   -------+-----------------------------------------
>   <Word> |  <Previous Word><Next Word>

> The main problem with this approach as outlined, is that the index will be
> at least 3 times the size of the collected documents since the previous
> and next word is stored for each word.

But worse, you now have to store copies of every unique 3 word
sequence in the document instead of just unique words, so frequently
mentioned words will expand the index even more.  And you still
won't know if one 2 word sequence links up with another to complete
a 4 word phrase.  Could you instead store a list of positions within
the document where each word appears?  Then after looking up the
potential matches containing all the words, discard the ones where
you can't assemble a consecutive list of position numbers matching
the words in the phrase.

  Les Mikesell
    [EMAIL PROTECTED]
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to