> On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot <ear...@gmail.com> wrote:
>
>> Your synonyms will break if you try searching for phrases.
>> Building on your example, "food place in new york" will find nothing,
>> because 'place' and 'in' share the same position.
>
> It'd be great to get multi-word synonyms fully working...
>
> How would you change how Lucene indexes token positions to do this 
> "correctly"?
You need an ability to put two tokens in the same position, with
different posIncrements.

One variant from the top of my head is to introduce a notion of span,
so token becomes (text, span, incr).
(restaurant, 1, 0), (food, 0, 1), (place, 0, 1), (in, 0, 1), (new, 0,
1), (york, 0, 1)

The span affects distance calculation between this term, and some that follows.
E.g. dist(food, in) = 2, because both food and place have incr=1, but
despite restaurant and food having same start position,
dist(restaurant, in) = 1, because restaurant spans an additional
position.

With something like that I think it is possible to formulate an
algorithm for indexing and query rewriting that does "correct"
multiword synonyms.

Right now I cheat when rewriting a query. If my syngroup is a part of
the phrase, and I know that this syngroup has longer phrases than the
one currently detected, I do a span or sloppy phrase query. That
works, but theoretically could match a wrong document.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to