Re: Highlight with Proximity search throws an exception

Michael McCandless Fri, 09 Oct 2020 08:23:33 -0700

Looks like Juraj opened https://issues.apache.org/jira/browse/LUCENE-9568
(thanks!).


Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 2, 2020 at 12:03 PM Michael McCandless <
[email protected]> wrote:

> Hi Juraj+,
>
> This indeed smells like a bug.  FuzzyTermsEnum should never try to set a
> negative boost!
>
> Could you open an issue and open a PR (or attach a patch) with your test
> case?  Thank you for boiling this down.  This part really made me chuckle:
>
> > When our text contains an apostrophe followed by a single character AND
> we our search query is composed of exactly two letters followed by
> proximity search AND we use highlighting, we get an exception:
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Oct 1, 2020 at 12:48 PM Michael Sokolov <[email protected]>
> wrote:
>
>> I traced this to this block in FuzzyTermsEnum:
>>
>>     if (ed == 0) { // exact match
>>       boostAtt.setBoost(1.0F);
>>     } else {
>>       final int codePointCount = UnicodeUtil.codePointCount(term);
>>       int minTermLength = Math.min(codePointCount, termLength);
>>
>>       float similarity = 1.0f - (float) ed / (float) minTermLength;
>>       boostAtt.setBoost(similarity);
>>     }
>>
>> where in your test ed (edit distance) was 2 and minTermLength 1,
>> leading to negative boost.
>>
>> I don't really understand this code at all, but I wonder if it should
>> divide by maxTermLength instead of minTermLength?
>>
>> On Thu, Oct 1, 2020 at 9:54 AM Juraj Jurčo <[email protected]> wrote:
>> >
>> > Hi guys,
>> > we are trying to implement search and we have experienced a strange
>> situation. When our text contains an apostrophe followed by a single
>> character AND we our search query is composed of exactly two letters
>> followed by proximity search AND we use highlighting, we get an exception:
>> >
>> >> java.lang.IllegalArgumentException: boost must be a positive float,
>> got -1.0
>> >
>> >
>> > It seems there is a problem at:FuzzyTermsEnum.java:271 (float
>> similarity = 1.0f - (float) ed / (float) minTermLength) when it reaches it
>> with ed=2 and it sets a negative boost.
>> >
>> > I was able to reproduce the error with following code:
>> >
>> > import java.io.IOException;
>> > import java.nio.file.Path;
>> >
>> > import org.apache.commons.io.FileUtils;
>> > import org.apache.lucene.analysis.Analyzer;
>> > import org.apache.lucene.analysis.TokenStream;
>> > import org.apache.lucene.analysis.core.SimpleAnalyzer;
>> > import org.apache.lucene.document.Document;
>> > import org.apache.lucene.document.Field;
>> > import org.apache.lucene.document.TextField;
>> > import org.apache.lucene.index.IndexWriter;
>> > import org.apache.lucene.index.IndexWriterConfig;
>> > import org.apache.lucene.queryparser.classic.ParseException;
>> > import org.apache.lucene.queryparser.classic.QueryParser;
>> > import org.apache.lucene.search.Query;
>> > import org.apache.lucene.search.highlight.Highlighter;
>> > import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
>> > import org.apache.lucene.search.highlight.QueryScorer;
>> > import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
>> > import org.apache.lucene.search.highlight.TokenSources;
>> > import org.apache.lucene.store.Directory;
>> > import org.apache.lucene.store.FSDirectory;
>> > import org.junit.jupiter.api.Test;
>> >
>> > class FindSqlHighlightTest {
>> >
>> >    @Test
>> >    void reproduceHighlightProblem() throws IOException, ParseException,
>> InvalidTokenOffsetsException {
>> >       String text = "doesn't";
>> >       String field = "text";
>> >       //NOK: se~, se~2 and any higher number
>> >       //OK: sel~, s~, se~1
>> >       String uQuery = "se~";
>> >       int maxStartOffset = -1;
>> >       Analyzer analyzer = new SimpleAnalyzer();
>> >
>> >       Path indexLocation = Path.of("temp",
>> "reproduceHighlightProblem").toAbsolutePath();
>> >       if (indexLocation.toFile().exists()) {
>> >          FileUtils.deleteDirectory(indexLocation.toFile());
>> >       }
>> >       Directory indexDir = FSDirectory.open(indexLocation);
>> >
>> >       //Create index
>> >       IndexWriterConfig dimsIndexWriterConfig = new
>> IndexWriterConfig(analyzer);
>> >
>>  dimsIndexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
>> >       IndexWriter idxWriter = new IndexWriter(indexDir,
>> dimsIndexWriterConfig);
>> >       //add doc
>> >       Document doc = new Document();
>> >       doc.add(new TextField(field, text, Field.Store.NO));
>> >       idxWriter.addDocument(doc);
>> >       //commit
>> >       idxWriter.commit();
>> >       idxWriter.close();
>> >
>> >       //search & highlight
>> >       Query query = new QueryParser(field, analyzer).parse(uQuery);
>> >       Highlighter highlighter = new Highlighter(new
>> SimpleHTMLFormatter(), new QueryScorer(query));
>> >       TokenStream tokenStream = TokenSources.getTokenStream(field,
>> null, text, analyzer, maxStartOffset);
>> >       String highlighted = highlighter.getBestFragment(tokenStream,
>> text);
>> >       System.out.println(highlighted);
>> >    }
>> > }
>> >
>> >
>> > Could you please confirm whether it's a bug in Lucene or whether we do
>> something that is not allowed?
>> >
>> > Thanks a lot!
>> > Best,
>> > Juraj+
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Highlight with Proximity search throws an exception

Reply via email to