Re: htdig: Preliminary proposal for data structures to support phrases

Jason Moore Mon, 8 Jun 1998 13:57:00 -0400
On Mon, 8 Jun 1998, Andrew Scherpbier wrote:

> Leslie Mikesell wrote:
> > 
> > According to jmoore:
> > > What I propose is to store the words before and after each word in the
> > > index.
> > >   -------+-----------------------------------------
> > >   <Word> |  <Previous Word><Next Word>
> > 

> > Could you instead store a list of positions within
> > the document where each word appears?  Then after looking up the
> > potential matches containing all the words, discard the ones where
> > you can't assemble a consecutive list of position numbers matching
> > the words in the phrase.
> 
> Interesting ideas.  My plan to support phrase searching in ht://Dig 4 would
> add a word position to each word (first word in the document is word #0,
> second one #1, etc.)
> So, the word table looks something like this:
> 
> create table word
> (
>     word varchar(<wordlength>), // The word in question
>     docid int,                  // reference to the document of the word
>     index int,                  // location of word in doc, starting at 0
>     anchorid int,               // references to nearest named anchor
>     context byte                // context of word (<h1> vs. <title>, etc)
> );
> 
> The only thing I am not sure about is whether to put the anchorid field in the
> word table or create a separate table that maps documents to anchors and their
> index in the document.  It would save the 4 bytes per word record and still
> provide the anchor information.
> 

Ah, you want to start using a relational database as the backend.  Each
word is a row in the word table.  ok.

(btw - what is the purpose of anchorid?  It currently exists in htdig and
I couldn't figure out what it does.  Does it need to be an long int?)

> With this scheme, several types of searches become possible:
> *  phrase searches
> *  "near" searches
> *  "before" or "after" searches
> *  etc.
> 
> The only disadvantage is the space complexity compared to the GDBM variable
> length record approach currently used in ht://Dig.
> This is the main reason I want to move away from a hash-based database like
> GDBM; this approach requires non-unique keys.
>

I'm just wondering how you would search for a phrase with the rdb
approach.  It seems to me that you would have to run a lot of sql queries. 
One to find the docs and locations of the first word.  Then for each
<docid> <index/location> returned from this query, another query would
have to be run looking for matches in each document with the second word
and a location of +1..  I'm not by any means a sql god but, i'd really be
interested in seeing some (pseduo) sql that would do this.

:jason

[EMAIL PROTECTED]

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.
Re: htdig: Preliminary proposal for data structures to support phrases

Reply via email to