There is already analyzeBoolean/analyzeMultiBoolean there that you can use for this. You can look at any attribute on the tokenstream you want. I don't see any need to add any more API.
On 11/21/18, Doug Turnbull <[email protected]> wrote: > I agree there is a tension between analysis and query parser > responsibilities (or external to how queries are constructed). I wonder > what you'd think of making QueryBuilder more easily subclassible by passing > more term metadata to newSynonymQuery (such as types etc). This would let > you select an alt strategy (such as some of the scoring systems used in the > query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing > something with a term labeled a hyponym/hypernym in a QueryBuilder > subclass.. > > -Doug > > On Wed, Nov 21, 2018 at 8:09 AM Robert Muir <[email protected]> wrote: > >> I don't think we should put scoring stuff into the analysis chain like >> this. It already has a laundry list of responsibilities. >> >> Analysis chain can tell you the term is stacked or its a certain type >> or occurs a certain number of times, but it shouldn't be supplying >> things such as floating point boosts. That kind of scoring >> manipulation needs to really happen in query parsing/somewhere else. >> >> On 11/20/18, jim ferenczi <[email protected]> wrote: >> > Sorry for the late reply, >> > >> >> So perhaps one way forward to contribute this sort of thing into >> >> Lucene >> > is we could implement additional QueryBuilder implementations that >> provide >> > such functionality? >> > >> > I am not sure, I mentioned Solr and ES because I thought it was about >> > adding taxonomies and complex expansion mechanisms to query builders >> > but >> I >> > wonder if we can have a simple >> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It >> > could >> be >> > a new attribute that token filters would use when they produce stacked >> > tokens and that the QueryBuilder checks when he builds the >> > SynonymQuery. >> We >> > already have a TermFrequencyAttribute to alter the frequency of a term >> when >> > indexing so we could have the same mechanism for query term boosting ? >> > >> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull < >> > [email protected]> a écrit : >> > >> >> Thanks Jim >> >> >> >> Yeah, now that I think about it - I agree that perhaps the simplest >> >> option >> >> would to create alternate query builders. I think there's a couple of >> >> enhancement to the base class that would be nice, such as >> >> - Some additional token attributes passed to newSynonymQuery, such as >> the >> >> type (was this a synonym or hyponym or something else...) >> >> - The ability to differentiate between the original query term and the >> >> generated synonym terms >> >> - Consistent support for phrases >> >> >> >> I think part of my goal too is to help people without the use of >> plugins. >> >> As we often are in scenarios at OpenSource Connections where people >> won't >> >> be able to use a plugin. In this case alternate expansions around >> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams >> >> have >> >> using Solr/Lucene/ES. >> >> >> >> So perhaps one way forward to contribute this sort of thing into >> >> Lucene >> >> is >> >> we could implement additional QueryBuilder implementations that >> >> provide >> >> such functionality? >> >> >> >> Thanks >> >> -Doug >> >> >> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi <[email protected]> >> >> wrote: >> >> >> >>> You can easily customize the query that is used for synonyms in a >> custom >> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is >> >>> intended for subclasses that wish to customize the generated >> >>> queries." >> so >> >>> I >> >>> don't think we need to do anything there. I agree that it is >> >>> sometimes >> >>> better to use something different than the SynonymQuery but in the >> >>> general >> >>> case it works as expected and can be combined with other terms >> >>> naturally. >> >>> The kind of customization you want to achieve could be done in a >> >>> plugin >> >>> (or >> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom >> >>> token >> >>> filters and alter the query the way you want. My point here is that >> >>> the >> >>> QueryBuilder should remain simple, you can add the complexity you >> >>> want >> in >> >>> a >> >>> subclass. >> >>> However I think there is another area we need to fix, the scoring of >> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and >> >>> could >> >>> be >> >>> improved so we need something similar than the SynonymQuery that >> handles >> >>> multi phrases. >> >>> >> >>> >> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull < >> >>> [email protected]> a écrit : >> >>> >> >>>> Yes that is another good area (there are many). Although of course >> >>>> embeddings have their own challenges and complexities. (they often >> >>>> capture >> >>>> shared context, but not shared meaning). >> >>>> >> >>>> It's a data point though of something we'd want to include in such a >> >>>> framework, though not sure where it would go on the roadmap... >> >>>> >> >>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado >> >>>> <[email protected] >> > >> >>>> wrote: >> >>>> >> >>>>> What about the use of word embeddings (see >> >>>>> >> >>>>> >> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa >> ) >> >>>>> to compute word similarity? >> >>>>> >> >>>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull < >> >>>>> [email protected]> wrote: >> >>>>> >> >>>>>> Hey folks, >> >>>>>> >> >>>>>> I wanted to open up a discussion about a change to the usage of >> >>>>>> SynonymQuery. The goal here is to have a broader library of >> >>>>>> queries >> >>>>>> that >> >>>>>> can address other cases where related terms occupy the same >> >>>>>> position >> >>>>>> but >> >>>>>> don't have the same meaning (such as hypernyms, hyponyms, >> >>>>>> meronyms, >> >>>>>> ambiguous terms, and other query expansion situations). >> >>>>>> >> >>>>>> >> >>>>>> I bring this up because we've noticed (as I'm sure many of you >> >>>>>> have) >> >>>>>> the pattern of clients jamming any related term into a synonyms >> >>>>>> file >> >>>>>> and >> >>>>>> being surprised with odd results. I like the idea of enforcing >> >>>>>> "synonyms" >> >>>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell >> >>>>>> a >> >>>>>> client >> >>>>>> and setup simple patterns. So for synonyms, I think leaving >> >>>>>> SynonymQuery in >> >>>>>> place works great. >> >>>>>> >> >>>>>> But I feel if that's the rule, we need to open up discussion of >> other >> >>>>>> methods of scoring conceptual 'related term' relationships that >> >>>>>> usually >> >>>>>> comes up in the context of query expansion. This paper ( >> >>>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, >> >>>>>> surveys the current thinking for scoring various query expansion >> >>>>>> scenarios >> >>>>>> like those we deal with in the messy, ambiguous uses of synonyms >> >>>>>> in >> >>>>>> prod >> >>>>>> systems (khakis aren't trousers, they're a kind-of trouser). >> >>>>>> >> >>>>>> >> >>>>>> The cool thing is many of the ideas in this paper seem doable with >> >>>>>> existing Lucene index stats. So one might imagine a 'related >> >>>>>> terms' >> >>>>>> token >> >>>>>> filter that injected some scoring based on how related it really >> >>>>>> is >> >>>>>> to the original query term using Jaccard, Dice, or other methods >> >>>>>> called out >> >>>>>> in this paper. >> >>>>>> >> >>>>>> >> >>>>>> Another insightful set of research is this article on concept >> scoring >> >>>>>> ( >> >>>>>> >> https://usabilityetc.com/articles/information-retrieval-concept-matching/ >> >>>>>> ), which prioritizes related terms by connectedness and other >> >>>>>> factors. >> >>>>>> >> >>>>>> Needless to say, it's an open area how two terms someone has >> asserted >> >>>>>> are related to a query term 'should be' scored. It's one of those >> >>>>>> things >> >>>>>> that likely will forever depend on a number of domain and >> application >> >>>>>> specific factors. It's possibly a big opportunity of improvement >> >>>>>> for >> >>>>>> Lucene >> >>>>>> - but likely is about putting the right framework in place to >> >>>>>> allow >> >>>>>> for >> >>>>>> good default set of query-expansion scoring scenarios with options >> >>>>>> for >> >>>>>> customization. >> >>>>>> >> >>>>>> What I'm proposing is: >> >>>>>> >> >>>>>> >> >>>>>> - >> >>>>>> >> >>>>>> Submit a small patch that restricts SynonymQuery to tokens of >> type >> >>>>>> "SYNONYM" in the same posn, which allows some short term work >> >>>>>> to >> be >> >>>>>> done >> >>>>>> with the current Lucene QueryBuilder. Any additional >> >>>>>> non-synonym >> >>>>>> terms >> >>>>>> would be appended as a boolean query for now >> >>>>>> - >> >>>>>> >> >>>>>> Begin work on alternate 'related-term' scoring systems that >> >>>>>> also >> >>>>>> key off the token type in QueryBuilder to create custom scoring >> >>>>>> using >> >>>>>> built-in term stats. The possibilities here are endless, up to >> >>>>>> weighted >> >>>>>> related terms (ie Alessandro's patch), feeding back Rocchio >> >>>>>> relevance >> >>>>>> feedback, etc >> >>>>>> >> >>>>>> >> >>>>>> I'm curious what folks would think of a patch for bullet one >> followed >> >>>>>> by other patches down the road for additional functionality? >> >>>>>> >> >>>>>> (related to discussion in this Elasticsearch PR >> >>>>>> >> >>>>>> >> >>>>>> >> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249 >> >>>>>> ) >> >>>>>> >> >>>>>> -- >> >>>>>> CTO, OpenSource Connections >> >>>>>> Author, Relevant Search >> >>>>>> http://o19s.com/doug >> >>>>>> >> >>>>> -- >> >>>> CTO, OpenSource Connections >> >>>> Author, Relevant Search >> >>>> http://o19s.com/doug >> >>>> >> >>> -- >> >> CTO, OpenSource Connections >> >> Author, Relevant Search >> >> http://o19s.com/doug >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> -- > CTO, OpenSource Connections > Author, Relevant Search > http://o19s.com/doug > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
