Re: SynonymQuery / Query Expansion Strategies Discussion

Robert Muir Thu, 22 Nov 2018 04:19:14 -0800

There is already analyzeBoolean/analyzeMultiBoolean there that you can
use for this. You can look at any attribute on the tokenstream you
want. I don't see any need to add any more API.


On 11/21/18, Doug Turnbull <[email protected]> wrote:
> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <[email protected]> wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi <[email protected]> wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> > but
>> I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> > could
>> be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> > SynonymQuery.
>> We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > [email protected]> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> >> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that
>> >> provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[email protected]>
>> >> wrote:
>> >>
>> >>> You can easily customize the query that is used for synonyms in a
>> custom
>> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> >>> intended for subclasses that wish to customize the generated
>> >>> queries."
>> so
>> >>> I
>> >>> don't think we need to do anything there. I agree that it is
>> >>> sometimes
>> >>> better to use something different than the SynonymQuery but in the
>> >>> general
>> >>> case it works as expected and can be combined with other terms
>> >>> naturally.
>> >>> The kind of customization you want to achieve could be done in a
>> >>> plugin
>> >>> (or
>> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>> >>> token
>> >>> filters and alter the query the way you want. My point here is that
>> >>> the
>> >>> QueryBuilder should remain simple, you can add the complexity you
>> >>> want
>> in
>> >>> a
>> >>> subclass.
>> >>> However I think there is another area we need to fix, the scoring of
>> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and
>> >>> could
>> >>> be
>> >>> improved so we need something similar than the SynonymQuery that
>> handles
>> >>> multi phrases.
>> >>>
>> >>>
>> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>> >>> [email protected]> a écrit :
>> >>>
>> >>>> Yes that is another good area (there are many). Although of course
>> >>>> embeddings have their own challenges and complexities. (they often
>> >>>> capture
>> >>>> shared context, but not shared meaning).
>> >>>>
>> >>>> It's a data point though of something we'd want to include in such a
>> >>>> framework, though not sure where it would go on the roadmap...
>> >>>>
>> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado
>> >>>> <[email protected]
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> What about the use of word embeddings (see
>> >>>>>
>> >>>>>
>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
>> )
>> >>>>> to compute word similarity?
>> >>>>>
>> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>> >>>>> [email protected]> wrote:
>> >>>>>
>> >>>>>> Hey folks,
>> >>>>>>
>> >>>>>> I wanted to open up a discussion about a change to the usage of
>> >>>>>> SynonymQuery. The goal here is to have a broader library of
>> >>>>>> queries
>> >>>>>> that
>> >>>>>> can address other cases where related terms occupy the same
>> >>>>>> position
>> >>>>>> but
>> >>>>>> don't have the same meaning (such as hypernyms, hyponyms,
>> >>>>>> meronyms,
>> >>>>>> ambiguous terms, and other query expansion situations).
>> >>>>>>
>> >>>>>>
>> >>>>>> I bring this up because we've noticed (as I'm sure many of you
>> >>>>>> have)
>> >>>>>> the pattern of clients jamming any related term into a synonyms
>> >>>>>> file
>> >>>>>> and
>> >>>>>> being surprised with odd results. I like the idea of enforcing
>> >>>>>> "synonyms"
>> >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell
>> >>>>>> a
>> >>>>>> client
>> >>>>>> and setup simple patterns. So for synonyms, I think leaving
>> >>>>>> SynonymQuery in
>> >>>>>> place works great.
>> >>>>>>
>> >>>>>> But I feel if that's the rule, we need to open up discussion of
>> other
>> >>>>>> methods of scoring conceptual 'related term' relationships that
>> >>>>>> usually
>> >>>>>> comes up in the context of query expansion. This paper (
>> >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>> >>>>>> surveys the current thinking for scoring various query expansion
>> >>>>>> scenarios
>> >>>>>> like those we deal with in the messy, ambiguous uses of synonyms
>> >>>>>> in
>> >>>>>> prod
>> >>>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>> >>>>>>
>> >>>>>>
>> >>>>>> The cool thing is many of the ideas in this paper seem doable with
>> >>>>>> existing Lucene index stats. So one might imagine a 'related
>> >>>>>> terms'
>> >>>>>> token
>> >>>>>> filter that injected some scoring based on how related it really
>> >>>>>> is
>> >>>>>> to the original query term using Jaccard, Dice, or other methods
>> >>>>>> called out
>> >>>>>> in this paper.
>> >>>>>>
>> >>>>>>
>> >>>>>> Another insightful set of research is this article on concept
>> scoring
>> >>>>>> (
>> >>>>>>
>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>> >>>>>> ), which prioritizes related terms by connectedness and other
>> >>>>>> factors.
>> >>>>>>
>> >>>>>> Needless to say, it's an open area how two terms someone has
>> asserted
>> >>>>>> are related to a query term 'should be' scored. It's one of those
>> >>>>>> things
>> >>>>>> that likely will forever depend on a number of domain and
>> application
>> >>>>>> specific factors. It's possibly a big opportunity of improvement
>> >>>>>> for
>> >>>>>> Lucene
>> >>>>>> - but likely is about putting the right framework in place to
>> >>>>>> allow
>> >>>>>> for
>> >>>>>> good default set of query-expansion scoring scenarios with options
>> >>>>>> for
>> >>>>>> customization.
>> >>>>>>
>> >>>>>> What I'm proposing is:
>> >>>>>>
>> >>>>>>
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Submit a small patch that restricts SynonymQuery to tokens of
>> type
>> >>>>>>    "SYNONYM" in the same posn, which allows some short term work
>> >>>>>> to
>> be
>> >>>>>> done
>> >>>>>>    with the current Lucene QueryBuilder. Any additional
>> >>>>>> non-synonym
>> >>>>>> terms
>> >>>>>>    would be appended as a boolean query for now
>> >>>>>>    -
>> >>>>>>
>> >>>>>>    Begin work on alternate 'related-term' scoring systems that
>> >>>>>> also
>> >>>>>>    key off the token type in QueryBuilder to create custom scoring
>> >>>>>> using
>> >>>>>>    built-in term stats. The possibilities here are endless, up to
>> >>>>>> weighted
>> >>>>>>    related terms (ie Alessandro's patch), feeding back Rocchio
>> >>>>>> relevance
>> >>>>>>    feedback, etc
>> >>>>>>
>> >>>>>>
>> >>>>>> I'm curious what folks would think of a patch for bullet one
>> followed
>> >>>>>> by other patches down the road for additional functionality?
>> >>>>>>
>> >>>>>> (related to discussion in this Elasticsearch PR
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>> >>>>>> )
>> >>>>>>
>> >>>>>> --
>> >>>>>> CTO, OpenSource Connections
>> >>>>>> Author, Relevant Search
>> >>>>>> http://o19s.com/doug
>> >>>>>>
>> >>>>> --
>> >>>> CTO, OpenSource Connections
>> >>>> Author, Relevant Search
>> >>>> http://o19s.com/doug
>> >>>>
>> >>> --
>> >> CTO, OpenSource Connections
>> >> Author, Relevant Search
>> >> http://o19s.com/doug
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: SynonymQuery / Query Expansion Strategies Discussion

Reply via email to