Looks like Juraj opened https://issues.apache.org/jira/browse/LUCENE-9568 (thanks!).
Mike McCandless http://blog.mikemccandless.com On Fri, Oct 2, 2020 at 12:03 PM Michael McCandless < [email protected]> wrote: > Hi Juraj+, > > This indeed smells like a bug. FuzzyTermsEnum should never try to set a > negative boost! > > Could you open an issue and open a PR (or attach a patch) with your test > case? Thank you for boiling this down. This part really made me chuckle: > > > When our text contains an apostrophe followed by a single character AND > we our search query is composed of exactly two letters followed by > proximity search AND we use highlighting, we get an exception: > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Oct 1, 2020 at 12:48 PM Michael Sokolov <[email protected]> > wrote: > >> I traced this to this block in FuzzyTermsEnum: >> >> if (ed == 0) { // exact match >> boostAtt.setBoost(1.0F); >> } else { >> final int codePointCount = UnicodeUtil.codePointCount(term); >> int minTermLength = Math.min(codePointCount, termLength); >> >> float similarity = 1.0f - (float) ed / (float) minTermLength; >> boostAtt.setBoost(similarity); >> } >> >> where in your test ed (edit distance) was 2 and minTermLength 1, >> leading to negative boost. >> >> I don't really understand this code at all, but I wonder if it should >> divide by maxTermLength instead of minTermLength? >> >> On Thu, Oct 1, 2020 at 9:54 AM Juraj Jurčo <[email protected]> wrote: >> > >> > Hi guys, >> > we are trying to implement search and we have experienced a strange >> situation. When our text contains an apostrophe followed by a single >> character AND we our search query is composed of exactly two letters >> followed by proximity search AND we use highlighting, we get an exception: >> > >> >> java.lang.IllegalArgumentException: boost must be a positive float, >> got -1.0 >> > >> > >> > It seems there is a problem at:FuzzyTermsEnum.java:271 (float >> similarity = 1.0f - (float) ed / (float) minTermLength) when it reaches it >> with ed=2 and it sets a negative boost. >> > >> > I was able to reproduce the error with following code: >> > >> > import java.io.IOException; >> > import java.nio.file.Path; >> > >> > import org.apache.commons.io.FileUtils; >> > import org.apache.lucene.analysis.Analyzer; >> > import org.apache.lucene.analysis.TokenStream; >> > import org.apache.lucene.analysis.core.SimpleAnalyzer; >> > import org.apache.lucene.document.Document; >> > import org.apache.lucene.document.Field; >> > import org.apache.lucene.document.TextField; >> > import org.apache.lucene.index.IndexWriter; >> > import org.apache.lucene.index.IndexWriterConfig; >> > import org.apache.lucene.queryparser.classic.ParseException; >> > import org.apache.lucene.queryparser.classic.QueryParser; >> > import org.apache.lucene.search.Query; >> > import org.apache.lucene.search.highlight.Highlighter; >> > import org.apache.lucene.search.highlight.InvalidTokenOffsetsException; >> > import org.apache.lucene.search.highlight.QueryScorer; >> > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; >> > import org.apache.lucene.search.highlight.TokenSources; >> > import org.apache.lucene.store.Directory; >> > import org.apache.lucene.store.FSDirectory; >> > import org.junit.jupiter.api.Test; >> > >> > class FindSqlHighlightTest { >> > >> > @Test >> > void reproduceHighlightProblem() throws IOException, ParseException, >> InvalidTokenOffsetsException { >> > String text = "doesn't"; >> > String field = "text"; >> > //NOK: se~, se~2 and any higher number >> > //OK: sel~, s~, se~1 >> > String uQuery = "se~"; >> > int maxStartOffset = -1; >> > Analyzer analyzer = new SimpleAnalyzer(); >> > >> > Path indexLocation = Path.of("temp", >> "reproduceHighlightProblem").toAbsolutePath(); >> > if (indexLocation.toFile().exists()) { >> > FileUtils.deleteDirectory(indexLocation.toFile()); >> > } >> > Directory indexDir = FSDirectory.open(indexLocation); >> > >> > //Create index >> > IndexWriterConfig dimsIndexWriterConfig = new >> IndexWriterConfig(analyzer); >> > >> dimsIndexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE); >> > IndexWriter idxWriter = new IndexWriter(indexDir, >> dimsIndexWriterConfig); >> > //add doc >> > Document doc = new Document(); >> > doc.add(new TextField(field, text, Field.Store.NO)); >> > idxWriter.addDocument(doc); >> > //commit >> > idxWriter.commit(); >> > idxWriter.close(); >> > >> > //search & highlight >> > Query query = new QueryParser(field, analyzer).parse(uQuery); >> > Highlighter highlighter = new Highlighter(new >> SimpleHTMLFormatter(), new QueryScorer(query)); >> > TokenStream tokenStream = TokenSources.getTokenStream(field, >> null, text, analyzer, maxStartOffset); >> > String highlighted = highlighter.getBestFragment(tokenStream, >> text); >> > System.out.println(highlighted); >> > } >> > } >> > >> > >> > Could you please confirm whether it's a bug in Lucene or whether we do >> something that is not allowed? >> > >> > Thanks a lot! >> > Best, >> > Juraj+ >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
