Re: positional token info

Doug Cutting Tue, 21 Oct 2003 10:02:51 -0700

Erik Hatcher wrote:

Just for fun, I've written a simple stop filter that bumps the position increments to account for the stop words removed:

But its practically impossible to formulate a Query that can take advantage of this. A PhraseQuery, because Terms don't have positional info (only the transient tokens), only works using a slop factor which doesn't guarantee an exact match like I'm after. A PhrasePrefixQuery won't work any better as there is no way to add in a "blank" term to indicate a missing position.

The PhraseQuery code predates the setPositionIncrement feature.

You can use your filter to find phrases that don't contain stop words, e.g., when your filter is used, a query for the phrase "phone boy" won't match "phone the boy", as it would with the normal stop filter, but a query for "phone the boy" would also only match "phone boy".

One workaround is to simply not use a stop list. Then "phone boy" will only match "phone boy", and "phone the boy" will only match "phone the boy", and not "phone a boy" too. One can write a query parser which removes stop words unless they're in phrases. This is what Nutch and Google do.

If however you want "phone the boy" to match "phone X boy" where X is any word, then PhraseQuery would have to be extended. It's actually a pretty simple extension. Each term in a PhraseQuery corresponds to a PhrasePositions object. The 'offset' field within this is the position of the term in the phrase. If you construct the phrase positions for a two-term phrase so that the first has offset=0 and the second offset=2, then you'll get this sort of matching. So all that's needed is a new method PhraseQuery.add(Term term, int offset), and for these offsets to be stored so that they can be used when building PhrasePositions. Would this be a useful feature?

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: positional token info

Reply via email to