Re: Best fuzzy match on multiple terms

Matthias Müller Fri, 14 Jun 2019 11:10:02 -0700

Hi Boris,

"Acer campestre 'Rozi'" now receives a higher score with DFISimilarity
and BM25Similarity (with tuned 'b') instead of the standard BM25.


It really iswas a scoring/normalization issue: While "Rozi" gets a
higher score, "Acer" and "campestere" received lower values and the
combined result was fractions of a score below the desired hit.

-Matthias



Am Freitag, den 14.06.2019, 10:41 -0400 schrieb [email protected]:
> These are great suggestions, i was going to suggest explain plan of 
> query, too.
> 
> i really wonder in Your case why 'Rozi' entry does not get higher
> score.
> 
> Is there any effect from " ' " chars?
> 
> 
> In my case i have sort of reverse situation:
> 
> my query is maink~2 (mains was a special case where i still
> investigate)
> 
> i would expect the second result below to be the first result as it
> is 
> shorter and closest hit and first result to be the second result.
> 
> NASHUA in results: MAIN DUNSTABLE NASHUA HILLSBOROUGH NEW HAMPSHIRE 
> UNITED STATES in the 0 th result
> NASHUA in results: MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED
> STATES 
> in the 1 th result
> 
> 
> Best regards
> 
> 
> On 6/14/19 6:45 AM, Matthias Müller wrote:
> > Hi Namgyu and Tomoko,
> > 
> > your hint towards Explanation was very helpful and I was not aware
> > of
> > this feature.
> > 
> > I have now experimented with different scoring functions and it
> > seems
> > that DFISimilarity and BM25Similarity (with lower 'b') produce
> > results
> > in the direction I prefer, though not perfect for some cases [1].
> > 
> > The fuzzy term queries probably generate hardly predictable
> > similarities on additional fields. These add scores to the overall
> > result and also affect normalization.
> > 
> > Positively, the preferred matches are somewhere in the top ranks.
> > So
> > maybe rule-based assessment of the top N hits might help me achieve
> > what I want.
> > 
> > 
> > - Matthias
> > 
> > 
> > [1]:
> > "Abelia xgrandiflora" -> "Abelia xgrandiflora 'Wevo1' BELLA DONNA"
> > (score=13.7869625)
> > instead of the direct match
> > "Abelia xgrandiflora" -> "Abelia xgrandiflora" (score=13.74585)
> > 
> > Am Freitag, den 14.06.2019, 16:32 +0900 schrieb Tomoko Uchida:
> > > Hi Matthias,
> > > 
> > > What similarity class are you using.
> > > Just a guess... but possibly one reason is document (field)
> > > length
> > > normalization. Generally speaking shorter documents would get
> > > higher
> > > scores than longer documents.  (I saw that classic TFIDF
> > > similarity
> > > tends to give much higher scores to shorter documents. Newer
> > > version
> > > of lucene uses BM25 similarity as default, that moderates the
> > > tendency
> > > and has a tuning parameter 'b' to control the normalization
> > > effect.)
> > > See also:
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.elastic.co_guide_en_elasticsearch_guide_current_pluggable-2Dsimilarites.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=xgCA5llK_2kxvxRc4arpgbd1rhgRrSkOqD5j57CA-6Q&e=
> > > 
> > > As Namgyu Kim said, explain() API could help you to examine the
> > > details.
> > > 
> > > Tomoko
> > > 
> > > 2019年6月14日(金) 1:27 Namgyu Kim <[email protected]>:
> > > > Dear Matthias,
> > > > 
> > > > First you need to know about the Lucene's ranking concept.
> > > > Lucene's basic ranking is BM25 and it depends on your index
> > > > status.
> > > > (
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Okapi-5FBM25&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=3M7Yh2-tiEHd8DVhJc5fBeVfE65WvnaXsphnx2pCdfg&e=
> > > > )
> > > > There can be many reasons.
> > > > One of thing that I can guess is your index has a lot of 'rozi'
> > > > term so it
> > > > is getting worthless.
> > > > It is called IDF(Inverse Document Frequency).
> > > > Anyway, if you want to be a micro controller, you need to
> > > > understand the
> > > > BM25 expression.
> > > > 
> > > > And Lucene can tell you how your score came out.
> > > > Explanation can be used to get it.
> > > > I attach the sample code.
> > > > ======================================
> > > > IndexSearcher searcher = new IndexSearcher(reader);
> > > > TopDocs docs = searcher.search(q, hitsPerPage);
> > > > ScoreDoc[] hits = docs.scoreDocs;
> > > > 
> > > > for (int i = 0; i < hits.length; ++i) {
> > > >    int docId = hits[i].doc;
> > > >    Explanation explanation = searcher.explain(q, docId);
> > > >    // You can see how the score is calculated
> > > >    System.out.println("Explanation : " +
> > > > explanation.toString());
> > > > }
> > > > ======================================
> > > > 
> > > > I hope it helps :D
> > > > 
> > > > Best regards,
> > > > Namgyu Kim
> > > > 
> > > > P.S. For BM25, the default value in Lucene is k1 = 1.2, b =
> > > > 0.75.
> > > > 
> > > > 2019년 6월 14일 (금) 오전 12:54, <[email protected]>님이 작성:
> > > > 
> > > > > i would suggest trying (indexing and searching) without === '
> > > > > ===
> > > > > s and
> > > > > see You can find it first.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > 
> > > > > On 6/13/19 11:25 AM, Matthias Müller wrote:
> > > > > > I am currently matching botanic names (with possible mis-
> > > > > > spellings)
> > > > > > against an indexed referenced list with Lucene. After quick
> > > > > > progress in
> > > > > > the beginning, I am struggeling with the proper query
> > > > > > design to
> > > > > > achieve
> > > > > > a ranking result I want.
> > > > > > 
> > > > > > Here is an example:
> > > > > > 
> > > > > > Search term: Acer campestre 'Rozi'
> > > > > > 
> > > > > > Tokenized (decomposed) representation:
> > > > > > acer
> > > > > > campestre
> > > > > > rozi
> > > > > > 
> > > > > > Top 10 hits:
> > > > > > {value=Acer campestre, score=12.288989}
> > > > > > {value=Acer campestre 'Rozi', score=11.955223} // <- why is
> > > > > > it
> > > > > > 2nd?
> > > > > > {value=Acer campestre 'Arends', score=10.640412}
> > > > > > {value=Acer campestre subsp. leiocarpon, score=10.640412}
> > > > > > {value=Acer campestre 'Carnival', score=10.640412}
> > > > > > {value=Acer campestre 'Commodore', score=10.640412}
> > > > > > {value=Acer campestre 'Nanum', score=10.640412}
> > > > > > {value=Acer campestre 'Elsrijk', score=10.640412}
> > > > > > {value=Acer campestre 'Fastigiatum', score=10.640412}
> > > > > > {value=Acer campestre 'Geessink', score=10.640412}]
> > > > > > 
> > > > > > 
> > > > > > And here is how I create my queries:
> > > > > > 
> > > > > > final BooleanQuery.Builder builder = new
> > > > > > BooleanQuery.Builder();
> > > > > >     // add individual tokens to query
> > > > > >     for (String token : fuzzyTokens) {
> > > > > >       final Term term = new Term(NAME_TOKENS.name(),
> > > > > > token);
> > > > > >       final FuzzyQuery fq = new FuzzyQuery(term);
> > > > > >       builder.add(fq, BooleanClause.Occur.SHOULD);
> > > > > >     }
> > > > > >     return builder.build();
> > > > > > }
> > > > > > 
> > > > > > 
> > > > > > Input names are analyzed with a StandardTokenizer and
> > > > > > Lowercase
> > > > > > filter
> > > > > > when they are added to the IndexWriter.
> > > > > > 
> > > > > > 
> > > > > > My question: How can I get a ranking that scores
> > > > > > "Acer campestre 'Rozi'" higher than "Acer campestre"?
> > > > > > I am sure there is an obvious way to achieve this that I
> > > > > > have
> > > > > > yet
> > > > > > failed to find.
> > > > > > 
> > > > > > 
> > > > > > -Matthias
> > > > > > 
> > > > > > 
> > > > > > ---------------------------------------------------------
> > > > > > ----
> > > > > > --------
> > > > > > To unsubscribe, e-mail: 
> > > > > > [email protected]
> > > > > > For additional commands, e-mail:
> > > > > > [email protected]
> > > > > > 
> > > > > -----------------------------------------------------------
> > > > > ----
> > > > > ------
> > > > > To unsubscribe, e-mail: 
> > > > > [email protected]
> > > > > For additional commands, e-mail: 
> > > > > [email protected]
> > > > > 
> > > > > 
> > > ---------------------------------------------------------------
> > > ------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > > 
> > 
> > -----------------------------------------------------------------
> > ----
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Best fuzzy match on multiple terms

Reply via email to