Re: Antw: Re: Behaviour of punctuation marks in phrase queries

Michael Gibney Fri, 17 May 2019 14:09:01 -0700

After further reflection, I think that upgrading to 8.1 (LUCENE-8730)
would actually not help in this case. It doesn't matter whether "a.b."
or "ab" would be indexed or evaluated first; they'd both have implied
positionLength 1 (as read from the index at query time), and would
both be evaluated before ("a" "b"), leaving the impression of a gap
between tokens, causing the match to be missed.


On Fri, May 17, 2019 at 12:29 PM Michael Gibney
<mich...@michaelgibney.net> wrote:
>
> The SpanNearQuery in association with "a.b." input and WDGF is
> expected behavior, since WDGF causes the query to search ("ab")|("a"
> "b"), as 1 or 2 tokens, respectively. The "a. b." input
> (whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so
> sticks with the more straightforward PhraseQuery implementation.
>
> That said, the problem you're encountering is related to a couple of issues:
> https://issues.apache.org/jira/browse/LUCENE-7398
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> For this case specifically, the problem is that NearSpansOrdered
> lazily returns one match per position *for the first subclause*. The
> or clause ("ab"|"a" "b"), because positionLength is not indexed, will
> always return "ab" first (implicit positionLength of 1). Again because
> "ab"'s actual positionLength of 2 from index-time WDGF is not stored
> in the index, the implicit positionLength of 1 at query-time gives the
> impression of a gap between "ab" and "isar", violating the "slop=0"
> constraint.
>
> Because NearSpansOrdered.nextStartPosition() always advances by
> calling nextStartPosition() on the first subclause (without exploring
> for variant matches in other subclauses), the top-level
> NearSpansOrdered advances after one attempt at matching, and the valid
> match is missed.
>
> Pending fixes to address the underlying issue (there is a candidate
> patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312),
> you could mitigate the problem to some extent by either forcing slop>0
> (which as of 7.6 will be expanded into MultiPhraseQuery -- see
> https://issues.apache.org/jira/browse/LUCENE-8531), or you could set
> preserveOriginal=true on both index-time and query-time WDGF and
> upgrade to 8.1 (which would prevent the extreme case of an *exact*
> character-for-character matching query turning up no results -- see
> https://issues.apache.org/jira/browse/LUCENE-8730).
>
> On Fri, May 17, 2019 at 11:47 AM Erick Erickson <erickerick...@gmail.com> 
> wrote:
> >
> > I’ll leave that explanation to someone who understands query parsers ;)
> >
> > > On May 17, 2019, at 7:57 AM, Doris Peter <doris.pe...@bsb-muenchen.de> 
> > > wrote:
> > >
> > > Thanks a lot! I tried the debug parameter, which shows interesting 
> > > differences:
> > >
> > > debug": {
> > >
> > >    "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> > >    "querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> > >    "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
> > >    "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
> > >    "QParser": "LuceneQParser"
> > > }
> > >
> > > debug": {
> > >        "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> > >        "querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> > >        "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, 
> > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > > 0, true)]), all_places_txt:donau], 0, true))",
> > >        "parsedquery_toString": "spanNear([all_places_txt:neuburg, 
> > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > > 0, true)]), all_places_txt:donau], 0, true)",
> > >        "QParser": "LuceneQParser"
> > >    }
> > >
> > >
> > > Something seems to go wrong here, as the parsedquery contains the 
> > > SpanNearQuery instead of a PhraseQuery.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >>>> Erick Erickson <erickerick...@gmail.com> 5/17/2019 4:27 PM >>>
> > > Three things:
> > >
> > > 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory 
> > > after it in the index config
> > >
> > > 2> It is usually unnecessary to have the exact same parameters at both 
> > > query and index time for WDGFF. If you’ve split parts up at index time 
> > > then mashed them all back together, you can usually only split them up at 
> > > query time.
> > >
> > > 3> try adding &debug=query to the query and see what the results show for 
> > > the parsed query. That usually gives you a clue what is really happening 
> > > .vs. what you think is happening.
> > >
> > > Best,
> > > Erick
> > >
> > >> On May 17, 2019, at 12:59 AM, Doris Peter <doris.pe...@bsb-muenchen.de> 
> > >> wrote:
> > >>
> > >> Hello,
> > >>
> > >> We use Solr 7.6.0 to build our index, and I have got a Question about
> > >> Phrase Queries:
> > >>
> > >> We use the following configuration in schema.xml:
> > >>
> > >>   <!-- Text Standard -->
> > >>   <fieldType name="text" class="solr.TextField"
> > >> positionIncrementGap="1000" sortMissingLast="true"
> > >> autoGeneratePhraseQueries="true">
> > >>     <analyzer type="index">
> > >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >>       <charFilter class="solr.MappingCharFilterFactory"
> > >> mapping="mapping-FoldToASCII.txt"/>
> > >>       <filter class="solr.CJKBigramFilterFactory"/>
> > >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> > >> protected="protectedword.txt"
> > >>            preserveOriginal="0" splitOnNumerics="1"
> > >> splitOnCaseChange="0"
> > >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> > >>            generateWordParts="1" generateNumberParts="1"
> > >> stemEnglishPossessive="1"
> > >>            types="wdfftypes.txt" />
> > >>       <filter class="solr.LengthFilterFactory" min="1"
> > >> max="2147483647"/>
> > >>       <filter class="solr.LowerCaseFilterFactory"/>
> > >>     </analyzer>
> > >>     <analyzer type="query">
> > >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >>       <charFilter class="solr.MappingCharFilterFactory"
> > >> mapping="mapping-FoldToASCII.txt"/>
> > >>       <filter class="solr.CJKBigramFilterFactory"/>
> > >>       <filter class="solr.WordDelimiterGraphFilterFactory"
> > >> protected="protectedword.txt"
> > >>            preserveOriginal="0" splitOnNumerics="1"
> > >> splitOnCaseChange="0"
> > >>            catenateWords="1" catenateNumbers="1" catenateAll="1"
> > >>            generateWordParts="1" generateNumberParts="1"
> > >> stemEnglishPossessive="1"
> > >>            types="wdfftypes.txt" />
> > >>       <filter class="solr.LengthFilterFactory" min="1"
> > >> max="2147483647"/>
> > >>       <filter class="solr.LowerCaseFilterFactory"/>
> > >>     </analyzer>
> > >>   </fieldType>
> > >>
> > >>
> > >>   If we search for a phrase like "Moosburg a.d. Isar" we don't get a
> > >> match, though it's definitely in our Index.
> > >>   If we search for "Moosburg a. d. Isar" with a blank between "a."
> > >> and "d." we get a match.
> > >>
> > >>   This also happens for other non-word characters, like ' or , for
> > >> example.
> > >>
> > >>   The strange thing about it is, that the Solr Analysis-Tool reports
> > >> a match for the first version, but when we send a Solr Query, we get no
> > >> result Documents.
> > >>
> > >>   Has anyone got an idea, what this could be?
> > >>
> > >>   Thank you very much in advance,
> > >>
> > >>   Doris Peter
> > >
> > >
> >

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

Reply via email to