Why would a search using a ComplexPhraseQueryParser throw an exception for some content, but not all content?

Shifflett, David [USA] Tue, 17 Aug 2021 08:22:24 -0700

I am using Lucene 8.2, but have also verified this on 8.9.

My query string is either ""by~1 word~1"", or ""ky~1 word~1"".


I am looking for a phrase of these 2 words, with potential 1 character 
misspelling, or fuzziness.

I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.

My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y 
word".

The first part of the test content is pulled from actual data my customers are 
trying to search.

For the query with 'by~1' the exception occurs if the content has '.b' or .y', 
but not '.k'

For the query with 'ky~1' the exception occurs if the content has '.k' or .y', 
but not '.b'

Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class phraseTest {

    public static Analyzer analyzer = new StandardAnalyzer();
    public static IndexWriterConfig config = new IndexWriterConfig(
            analyzer);
    public static RAMDirectory ramDirectory = new RAMDirectory();
    public static IndexWriter indexWriter;
    public static Query queryToSearch = null;
    public static IndexReader idxReader;
    public static IndexSearcher idxSearcher;
    public static TopDocs hits;
    public static String query_field = "Content";

    // Pick only one content string
    // public static String content = "AC-2.b word";
    public static String content = "AC-2.k word";
    // public static String content = "AC-2.y word";

    // Pick only one query string
    // public static String queryString = "\"by~1 word~1\"";
    public static String queryString = "\"ky~1 word~1\"";

    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws IOException {

        System.out.println("Content           is\n  " + content);
        System.out.println("Query field       is " + query_field);
        System.out.println("Query String      is '" + queryString + "'");

        Document doc = new Document(); // create a new document

        /**
         * Create a field with term vector enabled
         */
        FieldType type = new FieldType();
        
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setStoreTermVectors(true);
        type.setTokenized(true);
        type.setStoreTermVectorOffsets(true);

        //term vector enabled
        Field cField = new Field(query_field, content, type);
        doc.add(cField);

        try {
            indexWriter = new IndexWriter(ramDirectory, config);
            indexWriter.addDocument(doc);
            indexWriter.close();

            idxReader = DirectoryReader.open(ramDirectory);
            idxSearcher = new IndexSearcher(idxReader);
            ComplexPhraseQueryParser qp =
                new ComplexPhraseQueryParser(query_field, analyzer);
            queryToSearch = qp.parse(queryString);

            // Here is where the searching, etc starts
            hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
            System.out.println("scoreDoc size: " + hits.scoreDocs.length);

            // highlight the hits ...

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }
}

Here is the exception (using Lucene 8.2):


Exception in thread "main" java.lang.IllegalArgumentException: Unknown query 
type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string 
"ky~1 word~1"

    at 
org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)

    at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)

    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)

    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)

    at 
org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)

    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:427)

    at phraseTest.main(phraseTest.java:79)`


Am I using ComplexPhraseQueryParser wrong?

Is this a bug in Lucene?

I have also tested this with a query string like ""dog~2 word~1"".
This causes the same exception if the content has ‘.d’, ‘.o’, or ‘.g’.

Looks like a fuzzy term that reduces to 1 character runs into trouble when 
encountering a matching single character term in the content.

Thanks in advance for any suggestions, or guidance,

David Shifflett

Why would a search using a ComplexPhraseQueryParser throw an exception for some content, but not all content?

Reply via email to