On Mon, 8 Jun 1998, Leslie Mikesell wrote:

> According to Jason Moore:
> > 
> > > And you still
> > > won't know if one 2 word sequence links up with another to complete
> > > a 4 word phrase.  
> > 
> > Sure I do.  e.g. "One two three four" would start the search at the second
> > term and end at the second last - so there would be two searches:
> > 
> > 1. "two"   where prev is "one" and next is "three"
> > 2. "three" where prev is "two" and next is "four"
> 
> That means you have to have every three-word sequence indexed, not
> just every unique sequence, and you have to wade through all the
> possibilities the hard way.
> 

hmm, that's true.  A quick grep on a 14 meg mailing list archive reveals
37624 lines containing "is" and 32608 lines containing "the".  That would
be a lot of "hits" to linearly search through.  (I guess stop words would
be essential using this approach...)

> > > Could you instead store a list of positions within
> > > the document where each word appears?  Then after looking up the
> > > potential matches containing all the words, discard the ones where
> > > you can't assemble a consecutive list of position numbers matching
> > > the words in the phrase.
> > 
> > You could, but I think you might have some lookup efficiency problems.  If
> > i had a gdbm file of words for keys and kept the absolute "location" (some
> > sort of number) then the only way to find the next word would be to
> > linearly search through all the results looking for the next position..
> > If you had an additional index of postions to words, then you might have
> > something... Have you tried this "location" approach before? 
>
> You would look up the words first.  There would be only one instance
> of the word indexed per document so it can be a gdbm key.  Part of the value
> data would be the list of of locations.  

ok, so the data structure would be something like the following:

[Word DB]
<Word>                  : <Word ID>

[Url DB]
<Url ID>                : <Url>

[Doc DB]
<Word ID>               : <Url ID>[<Url ID>]..

[Location DB]
<Word ID><Url ID>       : <Location>[<Location>...]

> To find a phrase you would
> find the first word, then the second, then check to see if you can
> match a sequence where the location of the second word follows the
> first one, then repeat for the rest of the words in the phrase discarding
> the possible paths as soon as you see they can't work.   

yeah, that's cool.  My only concern is that each unique word in the
documents will use a key (albeit a small key and smallish value).  I'll
try this with perl tonight to see how fast the lookup is and how much disk
space the index uses. 

> I haven't
> tried this but I would expect the computation of possible paths
> to happen much faster than additional disk lookups.  Plus, it allows
> 'near' and 'not near' searchs with a variation in the math.
>

The indexing thing sounds like the way to go... I'll let you know how my
tests go.

:jason

[EMAIL PROTECTED]


----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to