Thanks for that insight, Mike!

I don't think anybody is in disagreement with the need for the query parser to present the full, white-space delimited pseudo-term sequence to analysis in one step. My proposal from last September recognized that.

Is there a decent writeup on PositionLengthAttribute? I mean, the Javadoc says "The positionLength determines how many positions this token spans", which doesn't sound very relevant to multi-term synonyms that span multiple positions.

-- Jack Krupansky

-----Original Message----- From: Michael McCandless
Sent: Friday, January 25, 2013 4:41 PM
To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

PositionLengthAttribute is sufficient to "express" the true graph, but
SynonymFilter has not been fully fixed to properly set it.
Specifically, it cannot "create new positions", which is what's
necessary if you "expand" when applying synonyms (e.g., dns -> domain
name service).

It is better to do the reverse: map the multi-word phrase down to a
single token, at indexing time (domain name service -> dns): you get
accurate scoring (exact docFreq for how many docs have either dns or
"domain name service") and faster search performance, and PosLenAtt is
properly set, and you workaround the fact that the index cannot index
the position length att (since you never create alternate "paths" in
the token graph).  The downside is you must re-index if you change
your synonyms.

However: once we fix QueryParser to stop splitting on whitespace (it's
really ridiculous that it does so: it causes so many problems), and
fix SynFilter to "create positions", it is in theory possible to take
the resulting graph (if you "expand" when applying synonyms) and
enumerate the "correct" query (something like MultiPhraseQuery, or OR
of them, or something; maybe we'll need WordGraphQuery), and get the
correct results.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jan 25, 2013 at 11:27 AM, Jack Krupansky
<j...@basetechnology.com> wrote:
One clarification from my previous comment: One requirement is to prevent
false matches for instances of "heart infarction" and "myocardial attack" -
the current synonym filter does not preserver the "path" or term ordering
within the multi-term phrases. Even if the query parser does present the
full term sequence as a single input string.

Yes, the position information is preserved, but there is no "path" attribute
to be able to tell that "heart" was before "attack" as opposed to before
"infarction".


-- Jack Krupansky

-----Original Message----- From: Robert Muir
Sent: Friday, January 25, 2013 9:47 AM

To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky <j...@basetechnology.com>
wrote:

Here's an example query with q.op=AND:

   causes of heart attack

And I have this synonym definition:

   heart attack, myocardial infarction

So, what is the alleged query parser fix so that the query is treated as:

   causes of ("heart attack" OR "myocardial infarction")


Thats actually inefficient and stupid to do. if you make a parser that
doesnt split on whitespace, you can just tell it to fold at index and
query time just like stemming. no OR necessary.

But I think you are trying to get off topic, again the real problem
affecting 99%+ users is that the lucene queryparser splits on
whitespace.

If this is fixed, then lots of things (not just synonyms, but other
basic shit that is broken today) starts working too:
https://issues.apache.org/jira/browse/LUCENE-2605

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to