Hi Boris, "Acer campestre 'Rozi'" now receives a higher score with DFISimilarity and BM25Similarity (with tuned 'b') instead of the standard BM25.
It really iswas a scoring/normalization issue: While "Rozi" gets a higher score, "Acer" and "campestere" received lower values and the combined result was fractions of a score below the desired hit. -Matthias Am Freitag, den 14.06.2019, 10:41 -0400 schrieb baris.ka...@oracle.com: > These are great suggestions, i was going to suggest explain plan of > query, too. > > i really wonder in Your case why 'Rozi' entry does not get higher > score. > > Is there any effect from " ' " chars? > > > In my case i have sort of reverse situation: > > my query is maink~2 (mains was a special case where i still > investigate) > > i would expect the second result below to be the first result as it > is > shorter and closest hit and first result to be the second result. > > NASHUA in results: MAIN DUNSTABLE NASHUA HILLSBOROUGH NEW HAMPSHIRE > UNITED STATES in the 0 th result > NASHUA in results: MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED > STATES > in the 1 th result > > > Best regards > > > On 6/14/19 6:45 AM, Matthias Müller wrote: > > Hi Namgyu and Tomoko, > > > > your hint towards Explanation was very helpful and I was not aware > > of > > this feature. > > > > I have now experimented with different scoring functions and it > > seems > > that DFISimilarity and BM25Similarity (with lower 'b') produce > > results > > in the direction I prefer, though not perfect for some cases [1]. > > > > The fuzzy term queries probably generate hardly predictable > > similarities on additional fields. These add scores to the overall > > result and also affect normalization. > > > > Positively, the preferred matches are somewhere in the top ranks. > > So > > maybe rule-based assessment of the top N hits might help me achieve > > what I want. > > > > > > - Matthias > > > > > > [1]: > > "Abelia xgrandiflora" -> "Abelia xgrandiflora 'Wevo1' BELLA DONNA" > > (score=13.7869625) > > instead of the direct match > > "Abelia xgrandiflora" -> "Abelia xgrandiflora" (score=13.74585) > > > > Am Freitag, den 14.06.2019, 16:32 +0900 schrieb Tomoko Uchida: > > > Hi Matthias, > > > > > > What similarity class are you using. > > > Just a guess... but possibly one reason is document (field) > > > length > > > normalization. Generally speaking shorter documents would get > > > higher > > > scores than longer documents. (I saw that classic TFIDF > > > similarity > > > tends to give much higher scores to shorter documents. Newer > > > version > > > of lucene uses BM25 similarity as default, that moderates the > > > tendency > > > and has a tuning parameter 'b' to control the normalization > > > effect.) > > > See also: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.elastic.co_guide_en_elasticsearch_guide_current_pluggable-2Dsimilarites.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=xgCA5llK_2kxvxRc4arpgbd1rhgRrSkOqD5j57CA-6Q&e= > > > > > > As Namgyu Kim said, explain() API could help you to examine the > > > details. > > > > > > Tomoko > > > > > > 2019年6月14日(金) 1:27 Namgyu Kim <kng0...@gmail.com>: > > > > Dear Matthias, > > > > > > > > First you need to know about the Lucene's ranking concept. > > > > Lucene's basic ranking is BM25 and it depends on your index > > > > status. > > > > ( > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Okapi-5FBM25&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=3M7Yh2-tiEHd8DVhJc5fBeVfE65WvnaXsphnx2pCdfg&e= > > > > ) > > > > There can be many reasons. > > > > One of thing that I can guess is your index has a lot of 'rozi' > > > > term so it > > > > is getting worthless. > > > > It is called IDF(Inverse Document Frequency). > > > > Anyway, if you want to be a micro controller, you need to > > > > understand the > > > > BM25 expression. > > > > > > > > And Lucene can tell you how your score came out. > > > > Explanation can be used to get it. > > > > I attach the sample code. > > > > ====================================== > > > > IndexSearcher searcher = new IndexSearcher(reader); > > > > TopDocs docs = searcher.search(q, hitsPerPage); > > > > ScoreDoc[] hits = docs.scoreDocs; > > > > > > > > for (int i = 0; i < hits.length; ++i) { > > > > int docId = hits[i].doc; > > > > Explanation explanation = searcher.explain(q, docId); > > > > // You can see how the score is calculated > > > > System.out.println("Explanation : " + > > > > explanation.toString()); > > > > } > > > > ====================================== > > > > > > > > I hope it helps :D > > > > > > > > Best regards, > > > > Namgyu Kim > > > > > > > > P.S. For BM25, the default value in Lucene is k1 = 1.2, b = > > > > 0.75. > > > > > > > > 2019년 6월 14일 (금) 오전 12:54, <baris.ka...@oracle.com>님이 작성: > > > > > > > > > i would suggest trying (indexing and searching) without === ' > > > > > === > > > > > s and > > > > > see You can find it first. > > > > > > > > > > Thanks > > > > > > > > > > > > > > > On 6/13/19 11:25 AM, Matthias Müller wrote: > > > > > > I am currently matching botanic names (with possible mis- > > > > > > spellings) > > > > > > against an indexed referenced list with Lucene. After quick > > > > > > progress in > > > > > > the beginning, I am struggeling with the proper query > > > > > > design to > > > > > > achieve > > > > > > a ranking result I want. > > > > > > > > > > > > Here is an example: > > > > > > > > > > > > Search term: Acer campestre 'Rozi' > > > > > > > > > > > > Tokenized (decomposed) representation: > > > > > > acer > > > > > > campestre > > > > > > rozi > > > > > > > > > > > > Top 10 hits: > > > > > > {value=Acer campestre, score=12.288989} > > > > > > {value=Acer campestre 'Rozi', score=11.955223} // <- why is > > > > > > it > > > > > > 2nd? > > > > > > {value=Acer campestre 'Arends', score=10.640412} > > > > > > {value=Acer campestre subsp. leiocarpon, score=10.640412} > > > > > > {value=Acer campestre 'Carnival', score=10.640412} > > > > > > {value=Acer campestre 'Commodore', score=10.640412} > > > > > > {value=Acer campestre 'Nanum', score=10.640412} > > > > > > {value=Acer campestre 'Elsrijk', score=10.640412} > > > > > > {value=Acer campestre 'Fastigiatum', score=10.640412} > > > > > > {value=Acer campestre 'Geessink', score=10.640412}] > > > > > > > > > > > > > > > > > > And here is how I create my queries: > > > > > > > > > > > > final BooleanQuery.Builder builder = new > > > > > > BooleanQuery.Builder(); > > > > > > // add individual tokens to query > > > > > > for (String token : fuzzyTokens) { > > > > > > final Term term = new Term(NAME_TOKENS.name(), > > > > > > token); > > > > > > final FuzzyQuery fq = new FuzzyQuery(term); > > > > > > builder.add(fq, BooleanClause.Occur.SHOULD); > > > > > > } > > > > > > return builder.build(); > > > > > > } > > > > > > > > > > > > > > > > > > Input names are analyzed with a StandardTokenizer and > > > > > > Lowercase > > > > > > filter > > > > > > when they are added to the IndexWriter. > > > > > > > > > > > > > > > > > > My question: How can I get a ranking that scores > > > > > > "Acer campestre 'Rozi'" higher than "Acer campestre"? > > > > > > I am sure there is an obvious way to achieve this that I > > > > > > have > > > > > > yet > > > > > > failed to find. > > > > > > > > > > > > > > > > > > -Matthias > > > > > > > > > > > > > > > > > > --------------------------------------------------------- > > > > > > ---- > > > > > > -------- > > > > > > To unsubscribe, e-mail: > > > > > > java-user-unsubscr...@lucene.apache.org > > > > > > For additional commands, e-mail: > > > > > > java-user-h...@lucene.apache.org > > > > > > > > > > > ----------------------------------------------------------- > > > > > ---- > > > > > ------ > > > > > To unsubscribe, e-mail: > > > > > java-user-unsubscr...@lucene.apache.org > > > > > For additional commands, e-mail: > > > > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > --------------------------------------------------------------- > > > ------ > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > ----------------------------------------------------------------- > > ---- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org