Best fuzzy match on multiple terms

Matthias Müller Thu, 13 Jun 2019 08:26:05 -0700

I am currently matching botanic names (with possible mis-spellings)
against an indexed referenced list with Lucene. After quick progress in
the beginning, I am struggeling with the proper query design to achieve
a ranking result I want.


Here is an example:

Search term: Acer campestre 'Rozi'

Tokenized (decomposed) representation:
acer
campestre
rozi

Top 10 hits:
{value=Acer campestre, score=12.288989}
{value=Acer campestre 'Rozi', score=11.955223} // <- why is it 2nd?
{value=Acer campestre 'Arends', score=10.640412}
{value=Acer campestre subsp. leiocarpon, score=10.640412}
{value=Acer campestre 'Carnival', score=10.640412}
{value=Acer campestre 'Commodore', score=10.640412}
{value=Acer campestre 'Nanum', score=10.640412}
{value=Acer campestre 'Elsrijk', score=10.640412}
{value=Acer campestre 'Fastigiatum', score=10.640412}
{value=Acer campestre 'Geessink', score=10.640412}]


And here is how I create my queries:

final BooleanQuery.Builder builder = new BooleanQuery.Builder();
  // add individual tokens to query
  for (String token : fuzzyTokens) {
    final Term term = new Term(NAME_TOKENS.name(), token);
    final FuzzyQuery fq = new FuzzyQuery(term);
    builder.add(fq, BooleanClause.Occur.SHOULD);
  }
  return builder.build();
}


Input names are analyzed with a StandardTokenizer and Lowercase filter
when they are added to the IndexWriter.


My question: How can I get a ranking that scores
"Acer campestre 'Rozi'" higher than "Acer campestre"?
I am sure there is an obvious way to achieve this that I have yet
failed to find.


-Matthias


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Best fuzzy match on multiple terms

Reply via email to