Re: Best fuzzy match on multiple terms

baris . kazar Fri, 14 Jun 2019 07:25:25 -0700

These are great suggestions, i was going to suggest explain plan ofquery, too.


i really wonder in Your case why 'Rozi' entry does not get higher score.


Is there any effect from " ' " chars?


In my case i have sort of reverse situation:

my query is maink~2 (mains was a special case where i still investigate)

i would expect the second result below to be the first result as it isshorter and closest hit and first result to be the second result.

NASHUA in results: MAIN DUNSTABLE NASHUA HILLSBOROUGH NEW HAMPSHIREUNITED STATES in the 0 th resultNASHUA in results: MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED STATESin the 1 th result



Best regards


On 6/14/19 6:45 AM, Matthias Müller wrote:

Hi Namgyu and Tomoko,

your hint towards Explanation was very helpful and I was not aware of
this feature.

I have now experimented with different scoring functions and it seems
that DFISimilarity and BM25Similarity (with lower 'b') produce results
in the direction I prefer, though not perfect for some cases [1].

The fuzzy term queries probably generate hardly predictable
similarities on additional fields. These add scores to the overall
result and also affect normalization.

Positively, the preferred matches are somewhere in the top ranks. So
maybe rule-based assessment of the top N hits might help me achieve
what I want.


- Matthias


[1]:
"Abelia xgrandiflora" -> "Abelia xgrandiflora 'Wevo1' BELLA DONNA"
(score=13.7869625)
instead of the direct match
"Abelia xgrandiflora" -> "Abelia xgrandiflora" (score=13.74585)

Am Freitag, den 14.06.2019, 16:32 +0900 schrieb Tomoko Uchida:

Hi Matthias,

What similarity class are you using.
Just a guess... but possibly one reason is document (field) length
normalization. Generally speaking shorter documents would get higher
scores than longer documents.  (I saw that classic TFIDF similarity
tends to give much higher scores to shorter documents. Newer version
of lucene uses BM25 similarity as default, that moderates the
tendency
and has a tuning parameter 'b' to control the normalization effect.)
See also:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.elastic.co_guide_en_elasticsearch_guide_current_pluggable-2Dsimilarites.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=xgCA5llK_2kxvxRc4arpgbd1rhgRrSkOqD5j57CA-6Q&e=

As Namgyu Kim said, explain() API could help you to examine the
details.

Tomoko

2019年6月14日(金) 1:27 Namgyu Kim <[email protected]>:

Dear Matthias,

First you need to know about the Lucene's ranking concept.
Lucene's basic ranking is BM25 and it depends on your index status.
(https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Okapi-5FBM25&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=3M7Yh2-tiEHd8DVhJc5fBeVfE65WvnaXsphnx2pCdfg&e=)
There can be many reasons.
One of thing that I can guess is your index has a lot of 'rozi'
term so it
is getting worthless.
It is called IDF(Inverse Document Frequency).
Anyway, if you want to be a micro controller, you need to
understand the
BM25 expression.

And Lucene can tell you how your score came out.
Explanation can be used to get it.
I attach the sample code.
======================================
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(q, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;

for (int i = 0; i < hits.length; ++i) {
   int docId = hits[i].doc;
   Explanation explanation = searcher.explain(q, docId);
   // You can see how the score is calculated
   System.out.println("Explanation : " + explanation.toString());
}
======================================

I hope it helps :D

Best regards,
Namgyu Kim

P.S. For BM25, the default value in Lucene is k1 = 1.2, b = 0.75.

2019년 6월 14일 (금) 오전 12:54, <[email protected]>님이 작성:

i would suggest trying (indexing and searching) without === ' ===
s and
see You can find it first.

Thanks


On 6/13/19 11:25 AM, Matthias Müller wrote:

I am currently matching botanic names (with possible mis-
spellings)
against an indexed referenced list with Lucene. After quick
progress in
the beginning, I am struggeling with the proper query design to
achieve
a ranking result I want.

Here is an example:

Search term: Acer campestre 'Rozi'

Tokenized (decomposed) representation:
acer
campestre
rozi

Top 10 hits:
{value=Acer campestre, score=12.288989}
{value=Acer campestre 'Rozi', score=11.955223} // <- why is it
2nd?
{value=Acer campestre 'Arends', score=10.640412}
{value=Acer campestre subsp. leiocarpon, score=10.640412}
{value=Acer campestre 'Carnival', score=10.640412}
{value=Acer campestre 'Commodore', score=10.640412}
{value=Acer campestre 'Nanum', score=10.640412}
{value=Acer campestre 'Elsrijk', score=10.640412}
{value=Acer campestre 'Fastigiatum', score=10.640412}
{value=Acer campestre 'Geessink', score=10.640412}]


And here is how I create my queries:

final BooleanQuery.Builder builder = new
BooleanQuery.Builder();
    // add individual tokens to query
    for (String token : fuzzyTokens) {
      final Term term = new Term(NAME_TOKENS.name(), token);
      final FuzzyQuery fq = new FuzzyQuery(term);
      builder.add(fq, BooleanClause.Occur.SHOULD);
    }
    return builder.build();
}


Input names are analyzed with a StandardTokenizer and Lowercase
filter
when they are added to the IndexWriter.


My question: How can I get a ranking that scores
"Acer campestre 'Rozi'" higher than "Acer campestre"?
I am sure there is an obvious way to achieve this that I have
yet
failed to find.


-Matthias


-------------------------------------------------------------
--------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail:
[email protected]

---------------------------------------------------------------
------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Best fuzzy match on multiple terms

Reply via email to