Hi all,
last sunday we spent a bit on this topic, our considerations follow:

N.B. we didn't check the state of the art (thanks Doug for the nice survey
shared, I will definitely take a look later on) .
So we just wanted to figure out an initial improvement, that can later be
advanced following advanced state of the art formulas.
It is kinda related to Jim idea.
This was the output of our brainstorming:

*Introduction*
Currently in Apache Solr (and Elastic Search) there is no supported way to
manage true synonyms, hypernyms and hyponyms at query time.
A first attempt to add the support for that was done by Doug Turnbull with
the approach in the following Pull Requests [1].
We think that approach was a good starting point, but we do believe it
could be improved.

*Weaknesses of Current Approach*
The current approach in our opinion presents the following weaknesses :
- try to guess the hypernym/hyponym/synonym relation from the DF of the
terms
- doesn't favour the original query term necessary
- favour rarer hypernym/hyponym/synonym and don't differentiate them.

*Proposed Improvements*

   - Nym Class Priority Order
   - Nyms within a Class Ranked by Popularity


*1 - Onym Class Priority Order*
We believe it should be possible to give different priority to different
class of nyms (hypernym/hyponym/synonym).
Specifically we do believe that should be possible to model this priority
in scoring:

*Original Query Term > True Synonym > Hyponym > Hypernym .*

Additional benefit could be gained if such inequality could be customised
based on user requirements.
*i.e.*
Adding different shades of nyms and slighly different ordering :
Original Query Term > True Synonym > Hyponym > 2 level hyponym > Hypernym .

*2 - Onyms within a Class Ranked by Popularity*

Within the same class we believe we need to favour the most popular
(highest Document Frequency) onyms.
i.e. within true synonyms we'll favour the most popular one.
The same within hyponyms or hypernyms.
Generally within an Onym class we want to rank higher the terms with higher
document frequency.

*Proposed Solution*
The proposed solution is to score the different onyms in this way :

*Original Query Term -> *IDFQueryTerm
*True Synonym (boost: 1.0)* ->  IDFQueryTerm * 1/(1+IDFSynonym)
*Hyponym (boost<1.0)*->  IDFQueryTerm * 1/(1+IDFHyponym)
*Hypernym (boost<1.0)* ->  IDFQueryTerm * 1/(1+IDFHypernym)

You may noticed the introduction of the boost factor.
This is the key point of the Onym classification.
All the onyms with the same boost will belong to the same class.
This gives the user the flexibility of ranking the different Onyms classes
based on their preference.
The boost solves the problem 1 (*Onym Class Priority Order*).
Multiplying the original term IDF with the second part of the formula fixes
problem 2 (*Onyms within a Class Ranked by Popularity*) and guarantee the
original term to win anyway.

*Implementation*
The suggested implementation will cover different areas :
- implement the scoring logic through blended DF/ proxy term stats/ proxy
similarity (it must be investigated the best path to implement the designed
scoring)
- Give the user a configuration file to model the Onyms. A first modality
is already available through [2]. A first improvement could be to implement
the support for taxonomies such as :
/big cats/lion-panthera leo/simba-kimba.
A final solution will allow an integration with custom knowledge bases,
wordnet, ect ect
- what about performance ? you could add a configuration parameters that
cut the query expansion based on a boost threshold. We can imagine the
boost as the distance from the original concept, so the user should be able
to cut down the expanded terms to favour performances.

[1] https://issues.apache.org/jira/browse/SOLR-11662,
https://github.com/elastic/elasticsearch/pull/35422

[2] https://issues.apache.org/jira/browse/SOLR-12238
--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io


On Wed, Nov 21, 2018 at 2:34 AM Michael Sokolov <[email protected]> wrote:

> This is a great idea. It would also be compelling to modify the term
> frequency using this deboosting so that stacked indexed terms can be
> weighted according to their closeness to the original term.
>
> On Tue, Nov 20, 2018, 2:19 PM jim ferenczi <[email protected] wrote:
>
>> Sorry for the late reply,
>>
>> > So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> I am not sure, I mentioned Solr and ES because I thought it was about
>> adding taxonomies and complex expansion mechanisms to query builders but I
>> wonder if we can have a simple
>> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
>> be a new attribute that token filters would use when they produce stacked
>> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
>> already have a TermFrequencyAttribute to alter the frequency of a term when
>> indexing so we could have the same mechanism for query term boosting ?
>>
>> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> [email protected]> a écrit :
>>
>>> Thanks Jim
>>>
>>> Yeah, now that I think about it - I agree that perhaps the simplest
>>> option would to create alternate query builders. I think there's a couple
>>> of enhancement to the base class that would be nice, such as
>>> - Some additional token attributes passed to newSynonymQuery, such as
>>> the type (was this a synonym or hyponym or something else...)
>>> - The ability to differentiate between the original query term and the
>>> generated synonym terms
>>> - Consistent support for phrases
>>>
>>> I think part of my goal too is to help people without the use of
>>> plugins. As we often are in scenarios at OpenSource Connections where
>>> people won't be able to use a plugin. In this case alternate expansions
>>> around hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>>> have using Solr/Lucene/ES.
>>>
>>> So perhaps one way forward to contribute this sort of thing into Lucene
>>> is we could implement additional QueryBuilder implementations that provide
>>> such functionality?
>>>
>>> Thanks
>>> -Doug
>>>
>>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[email protected]>
>>> wrote:
>>>
>>>> You can easily customize the query that is used for synonyms in a
>>>> custom QueryBuilder. The javadocs of the *newSynonymQuery* says "This
>>>> is intended for subclasses that wish to customize the generated queries."
>>>> so I don't think we need to do anything there. I agree that it is sometimes
>>>> better to use something different than the SynonymQuery but in the general
>>>> case it works as expected and can be combined with other terms naturally.
>>>> The kind of customization you want to achieve could be done in a plugin (or
>>>> in Solr or ES) that extends the QueryBuilder, you can also use custom token
>>>> filters and alter the query the way you want. My point here is that the
>>>> QueryBuilder should remain simple, you can add the complexity you want in a
>>>> subclass.
>>>> However I think there is another area we need to fix, the scoring of
>>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
>>>> improved so we need something similar than the SynonymQuery that handles
>>>> multi phrases.
>>>>
>>>>
>>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>>> [email protected]> a écrit :
>>>>
>>>>> Yes that is another good area (there are many). Although of course
>>>>> embeddings have their own challenges and complexities. (they often capture
>>>>> shared context, but not shared meaning).
>>>>>
>>>>> It's a data point though of something we'd want to include in such a
>>>>> framework, though not sure where it would go on the roadmap...
>>>>>
>>>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> What about the use of word embeddings (see
>>>>>>
>>>>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>>>>> to compute word similarity?
>>>>>>
>>>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> I wanted to open up a discussion about a change to the usage of
>>>>>>> SynonymQuery. The goal here is to have a broader library of queries that
>>>>>>> can address other cases where related terms occupy the same position but
>>>>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>>>>> ambiguous terms, and other query expansion situations).
>>>>>>>
>>>>>>>
>>>>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>>>>> the pattern of clients jamming any related term into a synonyms file and
>>>>>>> being surprised with odd results. I like the idea of enforcing 
>>>>>>> "synonyms"
>>>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a 
>>>>>>> client
>>>>>>> and setup simple patterns. So for synonyms, I think leaving 
>>>>>>> SynonymQuery in
>>>>>>> place works great.
>>>>>>>
>>>>>>> But I feel if that's the rule, we need to open up discussion of
>>>>>>> other methods of scoring conceptual 'related term' relationships that
>>>>>>> usually comes up in the context of query expansion. This paper (
>>>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>>>>> surveys the current thinking for scoring various query expansion 
>>>>>>> scenarios
>>>>>>> like those we deal with in the messy, ambiguous uses of synonyms in prod
>>>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>>>>
>>>>>>>
>>>>>>> The cool thing is many of the ideas in this paper seem doable with
>>>>>>> existing Lucene index stats. So one might imagine a 'related terms' 
>>>>>>> token
>>>>>>> filter that injected some scoring based on how related it really is
>>>>>>> to the original query term using Jaccard, Dice, or other methods called 
>>>>>>> out
>>>>>>> in this paper.
>>>>>>>
>>>>>>>
>>>>>>> Another insightful set of research is this article on concept
>>>>>>> scoring (
>>>>>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>>>>>> ), which prioritizes related terms by connectedness and other
>>>>>>> factors.
>>>>>>>
>>>>>>> Needless to say, it's an open area how two terms someone has
>>>>>>> asserted are related to a query term 'should be' scored. It's one of 
>>>>>>> those
>>>>>>> things that likely will forever depend on a number of domain and
>>>>>>> application specific factors. It's possibly a big opportunity of
>>>>>>> improvement for Lucene - but likely is about putting the right 
>>>>>>> framework in
>>>>>>> place to allow for good default set of query-expansion scoring scenarios
>>>>>>> with options for customization.
>>>>>>>
>>>>>>> What I'm proposing is:
>>>>>>>
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
>>>>>>>    type "SYNONYM" in the same posn, which allows some short term work 
>>>>>>> to be
>>>>>>>    done with the current Lucene QueryBuilder. Any additional 
>>>>>>> non-synonym terms
>>>>>>>    would be appended as a boolean query for now
>>>>>>>    -
>>>>>>>
>>>>>>>    Begin work on alternate 'related-term' scoring systems that also
>>>>>>>    key off the token type in QueryBuilder to create custom scoring using
>>>>>>>    built-in term stats. The possibilities here are endless, up to 
>>>>>>> weighted
>>>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio relevance
>>>>>>>    feedback, etc
>>>>>>>
>>>>>>>
>>>>>>> I'm curious what folks would think of a patch for bullet one
>>>>>>> followed by other patches down the road for additional functionality?
>>>>>>>
>>>>>>> (related to discussion in this Elasticsearch PR
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>>>>>>> )
>>>>>>>
>>>>>>> --
>>>>>>> CTO, OpenSource Connections
>>>>>>> Author, Relevant Search
>>>>>>> http://o19s.com/doug
>>>>>>>
>>>>>> --
>>>>> CTO, OpenSource Connections
>>>>> Author, Relevant Search
>>>>> http://o19s.com/doug
>>>>>
>>>> --
>>> CTO, OpenSource Connections
>>> Author, Relevant Search
>>> http://o19s.com/doug
>>>
>>

Reply via email to