Re: Fixing query-time multi-word synonym issue
Ha! Right, it's not about synonyms, it's about the QP doing the wrong thing. At this point I'm not even 100% sure which QP is the problem. I think it's the Lucene one, but also the SolrQueryParser, is this correct? But isn't there a flex parser in Lucene contrib somewhere? What is the state of that one? I could be mistaken, but it seems like that QP was written, stuffed in the queryparser/, and is never used anywhere in Solr or elsewhere? Is this correct or am I missing something? I'm referring to the Flexible query parser mentioned on http://lucene.apache.org/core/4_2_0/queryparser/overview-summary.html (via https://issues.apache.org/jira/browse/LUCENE-1567 ) I did not try it to see if it has the same problem with multi-word synonyms, but I remember the authors being quite ambitious about it. Anyone happens to know? Thanks, Otis On Thu, Jan 24, 2013 at 8:17 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Funny, I expected everyone would jump on this thread considering so many people hit this multi-word synonym issue... Probably because its not a synonyms problem: its just a bug in a specific queryparser. If you dont use that queryparser, you dont care: if you want to fix that one, well the only correct solution is obvious, fix the queryparser not to split on whitespace. hacking other things around it is no option at all :) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
On Fri, Jan 25, 2013 at 6:28 PM, Jack Krupansky j...@basetechnology.com wrote: Is there a decent writeup on PositionLengthAttribute? I mean, the Javadoc says The positionLength determines how many positions this token spans, which doesn't sound very relevant to multi-term synonyms that span multiple positions. I don't think we have good enough javadocs around PosLenAtt ... we really should add a simple graph examples. Though, it is a very expert topic ... The way to think of PosIncAtt is that it tells you the start node of this arc (= token), while PosLenAtt tells you the end node, except both atts are delta coded, making it hard to think about. Furthermore, all arcs (tokens) must be emitted in order, ie smaller numbered nodes (positions) must come first, and all arcs leaving each node must be enumerated before you go to the next node. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
Yeah, I suspected that it was going to be expert/cryptic. I think the real point, from my September proposal, was that we need a common piece of code that has all that expert smarts to reconstruct the graph and then can generate the Lucene Query structure that will match that reconstructed graph. It would also be great to have clear, detailed doc of the linear graph encoding, but it is the code to reconstitute the graph that is the real goal. And, of course, we need Synonym filter to do the full graph encoding. Last time I looked (September), the PosLenAtt didn't seem to have a value that I could make sense of in terms of the graph (I think it always had the same value), but maybe I just wasn't interpreting it according to the undocumented rules. I think a fair number of people want to be able to do query-time synonyms, so I wouldn't classify that as an expert use case, but I agree that dealing with graph encoding/decoding is more of an expert chore. Although, in the end, it is really just a small number of people who work on query parsers who actually have a need to know about synonym graphs. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Saturday, January 26, 2013 6:31 AM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue On Fri, Jan 25, 2013 at 6:28 PM, Jack Krupansky j...@basetechnology.com wrote: Is there a decent writeup on PositionLengthAttribute? I mean, the Javadoc says The positionLength determines how many positions this token spans, which doesn't sound very relevant to multi-term synonyms that span multiple positions. I don't think we have good enough javadocs around PosLenAtt ... we really should add a simple graph examples. Though, it is a very expert topic ... The way to think of PosIncAtt is that it tells you the start node of this arc (= token), while PosLenAtt tells you the end node, except both atts are delta coded, making it hard to think about. Furthermore, all arcs (tokens) must be emitted in order, ie smaller numbered nodes (positions) must come first, and all arcs leaving each node must be enumerated before you go to the next node. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
On Sat, Jan 26, 2013 at 11:16 AM, Jack Krupansky j...@basetechnology.com wrote: Yeah, I suspected that it was going to be expert/cryptic. I think the real point, from my September proposal, was that we need a common piece of code that has all that expert smarts to reconstruct the graph and then can generate the Lucene Query structure that will match that reconstructed graph. That would be nice to have. Maybe a possible starting point is TokenStreamToAutomaton? The only catch is that this creates an Automaton with bytes as the labels, but we (somehow) need tokens/terms as the labels. Then we just need AutomatonQuery ;) It would also be great to have clear, detailed doc of the linear graph encoding, but it is the code to reconstitute the graph that is the real goal. +1 And, of course, we need Synonym filter to do the full graph encoding. Last time I looked (September), the PosLenAtt didn't seem to have a value that I could make sense of in terms of the graph (I think it always had the same value), but maybe I just wasn't interpreting it according to the undocumented rules. At some point it was changed to set PosLenAtt, but only if the output is a single token (eg, domain name service - dns; in this case the dns token will have PosLenAtt=3). Creating new positions (dns - domain name service) is a harder change; normally it's only Tokenizers that create positions. But eg WordDelimiterFilter also needs to creates positions. It would be nice to have some supporting infrastructure to make it easier for token filters to create new positions; the trickiness is it requires the token filter to be non-deterministic in general because the pos len of a given token can be a function of whether future tokens are split (creating new positions). I think a fair number of people want to be able to do query-time synonyms, so I wouldn't classify that as an expert use case, but I agree that dealing with graph encoding/decoding is more of an expert chore. Right, it would be nice to support both, and since query-time could in theory work correctly (unlike index time since we can't index the graph), it's perhaps the more compelling / easier route for apps that really need exact/expanding multi-word synonyms. But we have to fix the QueryParsers first: indexing already cannot properly encode the graph, and query-time cannot get past the QueryParser ... Although, in the end, it is really just a small number of people who work on query parsers who actually have a need to know about synonym graphs. True. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
Here's an example query with q.op=AND: causes of heart attack And I have this synonym definition: heart attack, myocardial infarction So, what is the alleged query parser fix so that the query is treated as: causes of (heart attack OR myocardial infarction) The core problem with the synonym filter is that it mashes all the terms of a multi-term synonym to be at the same position so that the order (attack after heart and infarction after mycardial) is lost. What is needed is a synonym filter with a notion of path so the term sequences for each of the synonym alternatives is available for the query parser to generate the OR alternative queries. Granted, the query parser ALSO needs to present the full sequence of terms to the analyzer as one string causes of heart attack, but that alone doesn't address the synonym filter misbehavior. -- Jack Krupansky -Original Message- From: Robert Muir Sent: Friday, January 25, 2013 3:46 AM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue On Fri, Jan 25, 2013 at 12:48 AM, Jack Krupansky j...@basetechnology.com wrote: Otis, this is precisely why nothing will get done any time soon on the multi-term synonym issue - there isn't even common agreement that there is a problem, let alone common agreement on the specifics of the problem, let alone common agreement on a solution. I think you are the only one arguing the bug is a synonymsfilter problem. Even though technically the Solr Query Parser is now separate from the Lucene Query Parser, the synonym filter is still strictly Lucene. Addressing the multi-term synonym feature requires enhancement to the synonym filter, dude you have a bug in X, you fix the bug in X: you dont go hack around it in Y. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote: Here's an example query with q.op=AND: causes of heart attack And I have this synonym definition: heart attack, myocardial infarction So, what is the alleged query parser fix so that the query is treated as: causes of (heart attack OR myocardial infarction) Thats actually inefficient and stupid to do. if you make a parser that doesnt split on whitespace, you can just tell it to fold at index and query time just like stemming. no OR necessary. But I think you are trying to get off topic, again the real problem affecting 99%+ users is that the lucene queryparser splits on whitespace. If this is fixed, then lots of things (not just synonyms, but other basic shit that is broken today) starts working too: https://issues.apache.org/jira/browse/LUCENE-2605 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
One clarification from my previous comment: One requirement is to prevent false matches for instances of heart infarction and myocardial attack - the current synonym filter does not preserver the path or term ordering within the multi-term phrases. Even if the query parser does present the full term sequence as a single input string. Yes, the position information is preserved, but there is no path attribute to be able to tell that heart was before attack as opposed to before infarction. -- Jack Krupansky -Original Message- From: Robert Muir Sent: Friday, January 25, 2013 9:47 AM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote: Here's an example query with q.op=AND: causes of heart attack And I have this synonym definition: heart attack, myocardial infarction So, what is the alleged query parser fix so that the query is treated as: causes of (heart attack OR myocardial infarction) Thats actually inefficient and stupid to do. if you make a parser that doesnt split on whitespace, you can just tell it to fold at index and query time just like stemming. no OR necessary. But I think you are trying to get off topic, again the real problem affecting 99%+ users is that the lucene queryparser splits on whitespace. If this is fixed, then lots of things (not just synonyms, but other basic shit that is broken today) starts working too: https://issues.apache.org/jira/browse/LUCENE-2605 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
PositionLengthAttribute is sufficient to express the true graph, but SynonymFilter has not been fully fixed to properly set it. Specifically, it cannot create new positions, which is what's necessary if you expand when applying synonyms (e.g., dns - domain name service). It is better to do the reverse: map the multi-word phrase down to a single token, at indexing time (domain name service - dns): you get accurate scoring (exact docFreq for how many docs have either dns or domain name service) and faster search performance, and PosLenAtt is properly set, and you workaround the fact that the index cannot index the position length att (since you never create alternate paths in the token graph). The downside is you must re-index if you change your synonyms. However: once we fix QueryParser to stop splitting on whitespace (it's really ridiculous that it does so: it causes so many problems), and fix SynFilter to create positions, it is in theory possible to take the resulting graph (if you expand when applying synonyms) and enumerate the correct query (something like MultiPhraseQuery, or OR of them, or something; maybe we'll need WordGraphQuery), and get the correct results. Mike McCandless http://blog.mikemccandless.com On Fri, Jan 25, 2013 at 11:27 AM, Jack Krupansky j...@basetechnology.com wrote: One clarification from my previous comment: One requirement is to prevent false matches for instances of heart infarction and myocardial attack - the current synonym filter does not preserver the path or term ordering within the multi-term phrases. Even if the query parser does present the full term sequence as a single input string. Yes, the position information is preserved, but there is no path attribute to be able to tell that heart was before attack as opposed to before infarction. -- Jack Krupansky -Original Message- From: Robert Muir Sent: Friday, January 25, 2013 9:47 AM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote: Here's an example query with q.op=AND: causes of heart attack And I have this synonym definition: heart attack, myocardial infarction So, what is the alleged query parser fix so that the query is treated as: causes of (heart attack OR myocardial infarction) Thats actually inefficient and stupid to do. if you make a parser that doesnt split on whitespace, you can just tell it to fold at index and query time just like stemming. no OR necessary. But I think you are trying to get off topic, again the real problem affecting 99%+ users is that the lucene queryparser splits on whitespace. If this is fixed, then lots of things (not just synonyms, but other basic shit that is broken today) starts working too: https://issues.apache.org/jira/browse/LUCENE-2605 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
Thanks for that insight, Mike! I don't think anybody is in disagreement with the need for the query parser to present the full, white-space delimited pseudo-term sequence to analysis in one step. My proposal from last September recognized that. Is there a decent writeup on PositionLengthAttribute? I mean, the Javadoc says The positionLength determines how many positions this token spans, which doesn't sound very relevant to multi-term synonyms that span multiple positions. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Friday, January 25, 2013 4:41 PM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue PositionLengthAttribute is sufficient to express the true graph, but SynonymFilter has not been fully fixed to properly set it. Specifically, it cannot create new positions, which is what's necessary if you expand when applying synonyms (e.g., dns - domain name service). It is better to do the reverse: map the multi-word phrase down to a single token, at indexing time (domain name service - dns): you get accurate scoring (exact docFreq for how many docs have either dns or domain name service) and faster search performance, and PosLenAtt is properly set, and you workaround the fact that the index cannot index the position length att (since you never create alternate paths in the token graph). The downside is you must re-index if you change your synonyms. However: once we fix QueryParser to stop splitting on whitespace (it's really ridiculous that it does so: it causes so many problems), and fix SynFilter to create positions, it is in theory possible to take the resulting graph (if you expand when applying synonyms) and enumerate the correct query (something like MultiPhraseQuery, or OR of them, or something; maybe we'll need WordGraphQuery), and get the correct results. Mike McCandless http://blog.mikemccandless.com On Fri, Jan 25, 2013 at 11:27 AM, Jack Krupansky j...@basetechnology.com wrote: One clarification from my previous comment: One requirement is to prevent false matches for instances of heart infarction and myocardial attack - the current synonym filter does not preserver the path or term ordering within the multi-term phrases. Even if the query parser does present the full term sequence as a single input string. Yes, the position information is preserved, but there is no path attribute to be able to tell that heart was before attack as opposed to before infarction. -- Jack Krupansky -Original Message- From: Robert Muir Sent: Friday, January 25, 2013 9:47 AM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote: Here's an example query with q.op=AND: causes of heart attack And I have this synonym definition: heart attack, myocardial infarction So, what is the alleged query parser fix so that the query is treated as: causes of (heart attack OR myocardial infarction) Thats actually inefficient and stupid to do. if you make a parser that doesnt split on whitespace, you can just tell it to fold at index and query time just like stemming. no OR necessary. But I think you are trying to get off topic, again the real problem affecting 99%+ users is that the lucene queryparser splits on whitespace. If this is fixed, then lots of things (not just synonyms, but other basic shit that is broken today) starts working too: https://issues.apache.org/jira/browse/LUCENE-2605 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
Funny, I expected everyone would jump on this thread considering so many people hit this multi-word synonym issue... Mikhail, I looked at the slides and while I understand that use case well, I don't see where query-time multi-word synonyms come into play. Please enlighten me :) Thanks! Otis Solr ElasticSearch Support http://sematext.com/ On Jan 22, 2013 1:48 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: FWIW, multi-word synonyms is a side benefit of query parsing approach implemented by my team. Here how it looks like https://docs.google.com/a/griddynamics.com/presentation/pub?id=1oifLFI0MiA3ZyXZWisHJVRK13P8cki5yCABvABPObKwstart=falseloop=falsedelayms=3000#slide=id.g1006de00_2_34fee people frequent typo has been substituted to the correct brand. Five previous slides depicts the approach, the main idea is get rid of old good Boolean retrieval, and introduce own notion of matching. I can share more details if you wish. On Tue, Jan 22, 2013 at 7:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hello, I'm looking for some guidance around solving the infamous index-time vs. query-time multi-word synonym problem. Looking for help with understanding the pieces and effort involved, and also being on a lookout for any potential man, it will take you forever, you'll have to do major Lucene surgery type of warnings. I never looked deeply into this problem and my understanding is that multi-word synonyms don't work at query-time because QueryParser(?) simply breaks queries on spaces and thus makes it impossible for SynonymTokenFilter (?) to see the non-broken-up token sequence and do synonym expansion. I think this is also documented on the Wiki. Are there other pieces involved that I didn't mention, but should have? The following are 3 different efforts I found: https://issues.apache.org/jira/browse/LUCENE-4499 http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html Plus Jack's proposal: http://search-lucene.com/m/Zkj0k15dDGP1 Does any of the above approaches sound like the right one, or at least in the right direction, and stands the chance of being accepted? Thanks, Otis -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Fixing query-time multi-word synonym issue
On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Funny, I expected everyone would jump on this thread considering so many people hit this multi-word synonym issue... Probably because its not a synonyms problem: its just a bug in a specific queryparser. If you dont use that queryparser, you dont care: if you want to fix that one, well the only correct solution is obvious, fix the queryparser not to split on whitespace. hacking other things around it is no option at all :) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
Otis, this is precisely why nothing will get done any time soon on the multi-term synonym issue - there isn't even common agreement that there is a problem, let alone common agreement on the specifics of the problem, let alone common agreement on a solution. Even though technically the Solr Query Parser is now separate from the Lucene Query Parser, the synonym filter is still strictly Lucene. Addressing the multi-term synonym feature requires enhancement to the synonym filter, so without common agreement on the nature of the problem (which I already outlined), the only path forward is for Solr to provide its own synonym filter and leave the existing Lucene synonym filter intact. That said, I don't expect general community agreement on a path forward for multi-term synonyms any time soon. -- Jack Krupansky -Original Message- From: Robert Muir Sent: Thursday, January 24, 2013 8:17 PM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Funny, I expected everyone would jump on this thread considering so many people hit this multi-word synonym issue... Probably because its not a synonyms problem: its just a bug in a specific queryparser. If you dont use that queryparser, you dont care: if you want to fix that one, well the only correct solution is obvious, fix the queryparser not to split on whitespace. hacking other things around it is no option at all :) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fixing query-time multi-word synonym issue
There is one more issue which faces a similar problem, the queryparser splits on whitespace and when the field being searched against is multi-word then search doesn't behave correctly. But ideally these problems should not arise, I want to be a little more optimistic on this and see if there is a solution to these problems :) On Fri, Jan 25, 2013 at 11:18 AM, Jack Krupansky j...@basetechnology.comwrote: Otis, this is precisely why nothing will get done any time soon on the multi-term synonym issue - there isn't even common agreement that there is a problem, let alone common agreement on the specifics of the problem, let alone common agreement on a solution. Even though technically the Solr Query Parser is now separate from the Lucene Query Parser, the synonym filter is still strictly Lucene. Addressing the multi-term synonym feature requires enhancement to the synonym filter, so without common agreement on the nature of the problem (which I already outlined), the only path forward is for Solr to provide its own synonym filter and leave the existing Lucene synonym filter intact. That said, I don't expect general community agreement on a path forward for multi-term synonyms any time soon. -- Jack Krupansky -Original Message- From: Robert Muir Sent: Thursday, January 24, 2013 8:17 PM To: dev@lucene.apache.org Subject: Re: Fixing query-time multi-word synonym issue On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Funny, I expected everyone would jump on this thread considering so many people hit this multi-word synonym issue... Probably because its not a synonyms problem: its just a bug in a specific queryparser. If you dont use that queryparser, you dont care: if you want to fix that one, well the only correct solution is obvious, fix the queryparser not to split on whitespace. hacking other things around it is no option at all :) --**--**- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org --**--**- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Regards, Varun Thacker http://www.vthacker.in/
Re: Fixing query-time multi-word synonym issue
FWIW, multi-word synonyms is a side benefit of query parsing approach implemented by my team. Here how it looks like https://docs.google.com/a/griddynamics.com/presentation/pub?id=1oifLFI0MiA3ZyXZWisHJVRK13P8cki5yCABvABPObKwstart=falseloop=falsedelayms=3000#slide=id.g1006de00_2_34fee people frequent typo has been substituted to the correct brand. Five previous slides depicts the approach, the main idea is get rid of old good Boolean retrieval, and introduce own notion of matching. I can share more details if you wish. On Tue, Jan 22, 2013 at 7:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hello, I'm looking for some guidance around solving the infamous index-time vs. query-time multi-word synonym problem. Looking for help with understanding the pieces and effort involved, and also being on a lookout for any potential man, it will take you forever, you'll have to do major Lucene surgery type of warnings. I never looked deeply into this problem and my understanding is that multi-word synonyms don't work at query-time because QueryParser(?) simply breaks queries on spaces and thus makes it impossible for SynonymTokenFilter (?) to see the non-broken-up token sequence and do synonym expansion. I think this is also documented on the Wiki. Are there other pieces involved that I didn't mention, but should have? The following are 3 different efforts I found: https://issues.apache.org/jira/browse/LUCENE-4499 http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html Plus Jack's proposal: http://search-lucene.com/m/Zkj0k15dDGP1 Does any of the above approaches sound like the right one, or at least in the right direction, and stands the chance of being accepted? Thanks, Otis -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com