Re: SynonymQuery / Query Expansion Strategies Discussion

J. Delgado Fri, 16 Nov 2018 22:16:04 -0800

What about the use of word embeddings (see
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
to compute word similarity?


On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
[email protected]> wrote:

> Hey folks,
>
> I wanted to open up a discussion about a change to the usage of
> SynonymQuery. The goal here is to have a broader library of queries that
> can address other cases where related terms occupy the same position but
> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
> ambiguous terms, and other query expansion situations).
>
>
> I bring this up because we've noticed (as I'm sure many of you have) the
> pattern of clients jamming any related term into a synonyms file and being
> surprised with odd results. I like the idea of enforcing "synonyms" means
> exactly-the-same in Lucene-land. It's an easy thing to tell a client and
> setup simple patterns. So for synonyms, I think leaving SynonymQuery in
> place works great.
>
> But I feel if that's the rule, we need to open up discussion of other
> methods of scoring conceptual 'related term' relationships that usually
> comes up in the context of query expansion. This paper (
> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
> the current thinking for scoring various query expansion scenarios like
> those we deal with in the messy, ambiguous uses of synonyms in prod systems
> (khakis aren't trousers, they're a kind-of trouser).
>
>
> The cool thing is many of the ideas in this paper seem doable with
> existing Lucene index stats. So one might imagine a 'related terms' token
> filter that injected some scoring based on how related it really is to
> the original query term using Jaccard, Dice, or other methods called out in
> this paper.
>
>
> Another insightful set of research is this article on concept scoring (
> https://usabilityetc.com/articles/information-retrieval-concept-matching/),
> which prioritizes related terms by connectedness and other factors.
>
> Needless to say, it's an open area how two terms someone has asserted are
> related to a query term 'should be' scored. It's one of those things that
> likely will forever depend on a number of domain and application specific
> factors. It's possibly a big opportunity of improvement for Lucene - but
> likely is about putting the right framework in place to allow for good
> default set of query-expansion scoring scenarios with options for
> customization.
>
> What I'm proposing is:
>
>
>    -
>
>    Submit a small patch that restricts SynonymQuery to tokens of type
>    "SYNONYM" in the same posn, which allows some short term work to be done
>    with the current Lucene QueryBuilder. Any additional non-synonym terms
>    would be appended as a boolean query for now
>    -
>
>    Begin work on alternate 'related-term' scoring systems that also key
>    off the token type in QueryBuilder to create custom scoring using built-in
>    term stats. The possibilities here are endless, up to weighted related
>    terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc
>
>
> I'm curious what folks would think of a patch for bullet one followed by
> other patches down the road for additional functionality?
>
> (related to discussion in this Elasticsearch PR
>
> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
> )
>
> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

Re: SynonymQuery / Query Expansion Strategies Discussion

Reply via email to