PositionLengthAttribute is sufficient to "express" the true graph, but
SynonymFilter has not been fully fixed to properly set it.
Specifically, it cannot "create new positions", which is what's
necessary if you "expand" when applying synonyms (e.g., dns -> domain
name service).

It is better to do the reverse: map the multi-word phrase down to a
single token, at indexing time (domain name service -> dns): you get
accurate scoring (exact docFreq for how many docs have either dns or
"domain name service") and faster search performance, and PosLenAtt is
properly set, and you workaround the fact that the index cannot index
the position length att (since you never create alternate "paths" in
the token graph).  The downside is you must re-index if you change
your synonyms.

However: once we fix QueryParser to stop splitting on whitespace (it's
really ridiculous that it does so: it causes so many problems), and
fix SynFilter to "create positions", it is in theory possible to take
the resulting graph (if you "expand" when applying synonyms) and
enumerate the "correct" query (something like MultiPhraseQuery, or OR
of them, or something; maybe we'll need WordGraphQuery), and get the
correct results.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jan 25, 2013 at 11:27 AM, Jack Krupansky
<j...@basetechnology.com> wrote:
> One clarification from my previous comment: One requirement is to prevent
> false matches for instances of "heart infarction" and "myocardial attack" -
> the current synonym filter does not preserver the "path" or term ordering
> within the multi-term phrases. Even if the query parser does present the
> full term sequence as a single input string.
>
> Yes, the position information is preserved, but there is no "path" attribute
> to be able to tell that "heart" was before "attack" as opposed to before
> "infarction".
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Robert Muir
> Sent: Friday, January 25, 2013 9:47 AM
>
> To: dev@lucene.apache.org
> Subject: Re: Fixing query-time multi-word synonym issue
>
> On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky <j...@basetechnology.com>
> wrote:
>>
>> Here's an example query with q.op=AND:
>>
>>    causes of heart attack
>>
>> And I have this synonym definition:
>>
>>    heart attack, myocardial infarction
>>
>> So, what is the alleged query parser fix so that the query is treated as:
>>
>>    causes of ("heart attack" OR "myocardial infarction")
>>
>
> Thats actually inefficient and stupid to do. if you make a parser that
> doesnt split on whitespace, you can just tell it to fold at index and
> query time just like stemming. no OR necessary.
>
> But I think you are trying to get off topic, again the real problem
> affecting 99%+ users is that the lucene queryparser splits on
> whitespace.
>
> If this is fixed, then lots of things (not just synonyms, but other
> basic shit that is broken today) starts working too:
> https://issues.apache.org/jira/browse/LUCENE-2605
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to