I've followed the stop-word discussion with some interest, but I've yet to find a solution that completely satisfies our needs. I was wondering if anyone could suggest some other options to try short of a custom handler or building our own queries (DisMax does such a fine job generally!).
We are using DisMax, and indexing media titles (books, music). We want our queries to be sensitive to stop-words, but not so sensitive that we fail to match on missing or incorrect stop-words. For example, here are a set of queries and desired behavior: * it -> matches It by steven king (high relevance) and other titles with it therein, e.g. Some Like It Hot (lower relevance) * the the -> matches music by The The, other titles with the therein at lower relevance are fine * the sound of music -> matches The Sound of Music high relevance * a sound of music -> still matches The Sound of Music, lower relevance is fine * the doors -> matches music by The Doors, even though it is indexed just as "Doors" (our data supplier drops the definite article) * the life -> matches titles The Life with high relevance, matches titles of just Life with lower relevance Basically, we want direct matches (including stop-words) to be highly relevant and we use the phrase query mechanism for that, but we also want matches if the user mis-remembers the correct (stopped) prepositions or inserts a few irrelevant stop-words (like articles). We see this in the wild with non-trivial frequency -- the wrong choice of preposition ("on mice and men") or an article used that our data supplier didn't include in the original version ("doors"). One thing we tried is to include both a stopped version and a non-stopped version of the title in the qf field, in the hopes that this would retrieve all titles without stop-words and still allow us to include pure stop-word queries ("it"). However, DisMax constructs queries such that mixing stopped and non-stopped fields doesn't work as one might hope, as described well here: http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a11112461 Since qf controls the initial set of results retrieved for DisMax, and we don't want to use a pure stopped set of fields there (because we won't match on "it" as a query) nor a pure non-stopped set (won't get results for "a sound of music"), we'd seem to be out of luck unless we can figure out a way to augment the qf coverage. We've tried relaxing query term requirements to allow a missing word or two in the query via mm, but recall is amped up too much since non-stop-words tend to be dropped and you get a lot of results that match primarily just across stop-words. We've also considered creating a sort of equivalence class for all stop-words (defining synonyms to map stops to some special token) which would allow mis-remembered stop-words to be conflated, but then something like "it" would match anything that contained any stop-word -- again, too high on the recall. What I think we want is something like an "optional stop-word DisMax" that would mark stops as optional and construct queries such that stop-words aren't passed into fields that apply stop-word removal in query clauses (if that makes sense). Has anyone done anything similar or found a better way to handle stops that exhibits the desired behavior? Thanks in advance for any thoughts! And, being new to Solr, apologies if I'm confused in my reasoning somewhere. Ron