One of the ongoing potholes of Solr and Lucene is lack of full support for 
multi-word synonyms at query time. The root of the problem is twofold: 
individual terms are presented for analysis which precludes recognition of 
multi-term synonyms, and the output stream from the analyis process is a 
single, linear stream without regard to any graph/lattice structure for 
multiple synonyms.

I intend to file a Jira, but wanted to get some wide attention and feedback on 
whether people are ready to finally tackle this ongoing thorn in the side of an 
otherwise fantastic enterprise search tool.

My proposed solution is fourfold:

1. Add an attribute, call it “path” for now, to the analysis process so that 
tokens coming out of the analysis in a linear stream can be easily 
reconstituted into the graph/lattice for multiple synonyms (single or 
multi-term) at the same position in a token sequence. There could be multiple 
paths at a position and paths can be nested, possibly using a dot notation such 
as “1.3.2”. There may be better ways to do this – this is just an initial 
proposal to get the ball rolling.
2. Add a utility class and method for analysis for query parsers to present a 
sequence of adjacent terms, rather than a single term at a time, so that 
multiword synonyms can be recognized. Query parsers would be expected to 
present a “term sequence” – sequence of adjacent terms without intervening 
operators – at one time.
3. Add a Query generation class and method that can take the graph/lattice for 
a token sequence containing nested synonym alternatives and generate the 
appropriate Query structure with BooleanQuery SHOULD or SpanOrQuery to 
implement synonym alternatives at a given position.
4. Modify the most popular query parsers to use the new analysis/generation.

Obviously there are lots of fine details to resolve.

What I wanted to do right now is see if there is general support for pushing 
forward with such a radical change, say for Lucene and Solr 5.0, or I suppose 
some 4.x > 4.0.

If I get enough support, I’ll file the Jira. Otherwise, I’ll just wait a year 
and then try again.

I’m not personally committing to do the actual work, but simply to get the ball 
rolling and keep it rolling. I’ll do work to the extent that nobody else is 
jumping in first. And I certainly don’t want to propose some giant patch that 
never gets approved and has to be constantly updated as the rest of Lucene/Solr 
changes. I would home that pieces of this large task could be carved off and 
committed incrementally to avoid having a monster patch at the end.

So, the questions (primarily for committers) for now are:

1. Do people want to see this go forward now (reasonably near future as opposed 
to more than a year away)?
2. Does the overall approach seem feasible and low enough risk?
3. Will this approach provide people with search results they expect?
4. Is this a high enough value feature change to justify the effort?

As far as support for multi-word synonyms at index time... uhhhhh... that’s 
another story. I think the two (query vs. index) can be separated. The basic 
problem at index time is that if you index “heart attack” and “myocardial 
infarction” at the same positions, queries of “heart infarction” and 
“myocardial attack” will have false matches. And if the list of synonyms have 
varying lengths, the position of the next term will be off for phrase queries. 
In any case, I am proposing moving forward with a full solution at query time 
only, for now.

-- Jack Krupansky

Reply via email to