Hey folks, I wanted to open up a discussion about a change to the usage of SynonymQuery. The goal here is to have a broader library of queries that can address other cases where related terms occupy the same position but don't have the same meaning (such as hypernyms, hyponyms, meronyms, ambiguous terms, and other query expansion situations).
I bring this up because we've noticed (as I'm sure many of you have) the pattern of clients jamming any related term into a synonyms file and being surprised with odd results. I like the idea of enforcing "synonyms" means exactly-the-same in Lucene-land. It's an easy thing to tell a client and setup simple patterns. So for synonyms, I think leaving SynonymQuery in place works great. But I feel if that's the rule, we need to open up discussion of other methods of scoring conceptual 'related term' relationships that usually comes up in the context of query expansion. This paper ( https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys the current thinking for scoring various query expansion scenarios like those we deal with in the messy, ambiguous uses of synonyms in prod systems (khakis aren't trousers, they're a kind-of trouser). The cool thing is many of the ideas in this paper seem doable with existing Lucene index stats. So one might imagine a 'related terms' token filter that injected some scoring based on how related it really is to the original query term using Jaccard, Dice, or other methods called out in this paper. Another insightful set of research is this article on concept scoring ( https://usabilityetc.com/articles/information-retrieval-concept-matching/), which prioritizes related terms by connectedness and other factors. Needless to say, it's an open area how two terms someone has asserted are related to a query term 'should be' scored. It's one of those things that likely will forever depend on a number of domain and application specific factors. It's possibly a big opportunity of improvement for Lucene - but likely is about putting the right framework in place to allow for good default set of query-expansion scoring scenarios with options for customization. What I'm proposing is: - Submit a small patch that restricts SynonymQuery to tokens of type "SYNONYM" in the same posn, which allows some short term work to be done with the current Lucene QueryBuilder. Any additional non-synonym terms would be appended as a boolean query for now - Begin work on alternate 'related-term' scoring systems that also key off the token type in QueryBuilder to create custom scoring using built-in term stats. The possibilities here are endless, up to weighted related terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc I'm curious what folks would think of a patch for bullet one followed by other patches down the road for additional functionality? (related to discussion in this Elasticsearch PR https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249) -- CTO, OpenSource Connections Author, Relevant Search http://o19s.com/doug
