PositionLengthAttribute is sufficient to "express" the true graph, but SynonymFilter has not been fully fixed to properly set it. Specifically, it cannot "create new positions", which is what's necessary if you "expand" when applying synonyms (e.g., dns -> domain name service).
It is better to do the reverse: map the multi-word phrase down to a single token, at indexing time (domain name service -> dns): you get accurate scoring (exact docFreq for how many docs have either dns or "domain name service") and faster search performance, and PosLenAtt is properly set, and you workaround the fact that the index cannot index the position length att (since you never create alternate "paths" in the token graph). The downside is you must re-index if you change your synonyms. However: once we fix QueryParser to stop splitting on whitespace (it's really ridiculous that it does so: it causes so many problems), and fix SynFilter to "create positions", it is in theory possible to take the resulting graph (if you "expand" when applying synonyms) and enumerate the "correct" query (something like MultiPhraseQuery, or OR of them, or something; maybe we'll need WordGraphQuery), and get the correct results. Mike McCandless http://blog.mikemccandless.com On Fri, Jan 25, 2013 at 11:27 AM, Jack Krupansky <j...@basetechnology.com> wrote: > One clarification from my previous comment: One requirement is to prevent > false matches for instances of "heart infarction" and "myocardial attack" - > the current synonym filter does not preserver the "path" or term ordering > within the multi-term phrases. Even if the query parser does present the > full term sequence as a single input string. > > Yes, the position information is preserved, but there is no "path" attribute > to be able to tell that "heart" was before "attack" as opposed to before > "infarction". > > > -- Jack Krupansky > > -----Original Message----- From: Robert Muir > Sent: Friday, January 25, 2013 9:47 AM > > To: dev@lucene.apache.org > Subject: Re: Fixing query-time multi-word synonym issue > > On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky <j...@basetechnology.com> > wrote: >> >> Here's an example query with q.op=AND: >> >> causes of heart attack >> >> And I have this synonym definition: >> >> heart attack, myocardial infarction >> >> So, what is the alleged query parser fix so that the query is treated as: >> >> causes of ("heart attack" OR "myocardial infarction") >> > > Thats actually inefficient and stupid to do. if you make a parser that > doesnt split on whitespace, you can just tell it to fold at index and > query time just like stemming. no OR necessary. > > But I think you are trying to get off topic, again the real problem > affecting 99%+ users is that the lucene queryparser splits on > whitespace. > > If this is fixed, then lots of things (not just synonyms, but other > basic shit that is broken today) starts working too: > https://issues.apache.org/jira/browse/LUCENE-2605 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org