What about the use of word embeddings (see https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) to compute word similarity?
On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull < [email protected]> wrote: > Hey folks, > > I wanted to open up a discussion about a change to the usage of > SynonymQuery. The goal here is to have a broader library of queries that > can address other cases where related terms occupy the same position but > don't have the same meaning (such as hypernyms, hyponyms, meronyms, > ambiguous terms, and other query expansion situations). > > > I bring this up because we've noticed (as I'm sure many of you have) the > pattern of clients jamming any related term into a synonyms file and being > surprised with odd results. I like the idea of enforcing "synonyms" means > exactly-the-same in Lucene-land. It's an easy thing to tell a client and > setup simple patterns. So for synonyms, I think leaving SynonymQuery in > place works great. > > But I feel if that's the rule, we need to open up discussion of other > methods of scoring conceptual 'related term' relationships that usually > comes up in the context of query expansion. This paper ( > https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys > the current thinking for scoring various query expansion scenarios like > those we deal with in the messy, ambiguous uses of synonyms in prod systems > (khakis aren't trousers, they're a kind-of trouser). > > > The cool thing is many of the ideas in this paper seem doable with > existing Lucene index stats. So one might imagine a 'related terms' token > filter that injected some scoring based on how related it really is to > the original query term using Jaccard, Dice, or other methods called out in > this paper. > > > Another insightful set of research is this article on concept scoring ( > https://usabilityetc.com/articles/information-retrieval-concept-matching/), > which prioritizes related terms by connectedness and other factors. > > Needless to say, it's an open area how two terms someone has asserted are > related to a query term 'should be' scored. It's one of those things that > likely will forever depend on a number of domain and application specific > factors. It's possibly a big opportunity of improvement for Lucene - but > likely is about putting the right framework in place to allow for good > default set of query-expansion scoring scenarios with options for > customization. > > What I'm proposing is: > > > - > > Submit a small patch that restricts SynonymQuery to tokens of type > "SYNONYM" in the same posn, which allows some short term work to be done > with the current Lucene QueryBuilder. Any additional non-synonym terms > would be appended as a boolean query for now > - > > Begin work on alternate 'related-term' scoring systems that also key > off the token type in QueryBuilder to create custom scoring using built-in > term stats. The possibilities here are endless, up to weighted related > terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc > > > I'm curious what folks would think of a patch for bullet one followed by > other patches down the road for additional functionality? > > (related to discussion in this Elasticsearch PR > > https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249 > ) > > -- > CTO, OpenSource Connections > Author, Relevant Search > http://o19s.com/doug >
