One of the ongoing potholes of Solr and Lucene is lack of full support for multi-word synonyms at query time. The root of the problem is twofold: individual terms are presented for analysis which precludes recognition of multi-term synonyms, and the output stream from the analyis process is a single, linear stream without regard to any graph/lattice structure for multiple synonyms.
I intend to file a Jira, but wanted to get some wide attention and feedback on whether people are ready to finally tackle this ongoing thorn in the side of an otherwise fantastic enterprise search tool. My proposed solution is fourfold: 1. Add an attribute, call it “path” for now, to the analysis process so that tokens coming out of the analysis in a linear stream can be easily reconstituted into the graph/lattice for multiple synonyms (single or multi-term) at the same position in a token sequence. There could be multiple paths at a position and paths can be nested, possibly using a dot notation such as “1.3.2”. There may be better ways to do this – this is just an initial proposal to get the ball rolling. 2. Add a utility class and method for analysis for query parsers to present a sequence of adjacent terms, rather than a single term at a time, so that multiword synonyms can be recognized. Query parsers would be expected to present a “term sequence” – sequence of adjacent terms without intervening operators – at one time. 3. Add a Query generation class and method that can take the graph/lattice for a token sequence containing nested synonym alternatives and generate the appropriate Query structure with BooleanQuery SHOULD or SpanOrQuery to implement synonym alternatives at a given position. 4. Modify the most popular query parsers to use the new analysis/generation. Obviously there are lots of fine details to resolve. What I wanted to do right now is see if there is general support for pushing forward with such a radical change, say for Lucene and Solr 5.0, or I suppose some 4.x > 4.0. If I get enough support, I’ll file the Jira. Otherwise, I’ll just wait a year and then try again. I’m not personally committing to do the actual work, but simply to get the ball rolling and keep it rolling. I’ll do work to the extent that nobody else is jumping in first. And I certainly don’t want to propose some giant patch that never gets approved and has to be constantly updated as the rest of Lucene/Solr changes. I would home that pieces of this large task could be carved off and committed incrementally to avoid having a monster patch at the end. So, the questions (primarily for committers) for now are: 1. Do people want to see this go forward now (reasonably near future as opposed to more than a year away)? 2. Does the overall approach seem feasible and low enough risk? 3. Will this approach provide people with search results they expect? 4. Is this a high enough value feature change to justify the effort? As far as support for multi-word synonyms at index time... uhhhhh... that’s another story. I think the two (query vs. index) can be separated. The basic problem at index time is that if you index “heart attack” and “myocardial infarction” at the same positions, queries of “heart infarction” and “myocardial attack” will have false matches. And if the list of synonyms have varying lengths, the position of the next term will be off for phrase queries. In any case, I am proposing moving forward with a full solution at query time only, for now. -- Jack Krupansky