After further reflection, I think that upgrading to 8.1 (LUCENE-8730) would actually not help in this case. It doesn't matter whether "a.b." or "ab" would be indexed or evaluated first; they'd both have implied positionLength 1 (as read from the index at query time), and would both be evaluated before ("a" "b"), leaving the impression of a gap between tokens, causing the match to be missed.
On Fri, May 17, 2019 at 12:29 PM Michael Gibney <mich...@michaelgibney.net> wrote: > > The SpanNearQuery in association with "a.b." input and WDGF is > expected behavior, since WDGF causes the query to search ("ab")|("a" > "b"), as 1 or 2 tokens, respectively. The "a. b." input > (whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so > sticks with the more straightforward PhraseQuery implementation. > > That said, the problem you're encountering is related to a couple of issues: > https://issues.apache.org/jira/browse/LUCENE-7398 > https://issues.apache.org/jira/browse/LUCENE-4312 > > For this case specifically, the problem is that NearSpansOrdered > lazily returns one match per position *for the first subclause*. The > or clause ("ab"|"a" "b"), because positionLength is not indexed, will > always return "ab" first (implicit positionLength of 1). Again because > "ab"'s actual positionLength of 2 from index-time WDGF is not stored > in the index, the implicit positionLength of 1 at query-time gives the > impression of a gap between "ab" and "isar", violating the "slop=0" > constraint. > > Because NearSpansOrdered.nextStartPosition() always advances by > calling nextStartPosition() on the first subclause (without exploring > for variant matches in other subclauses), the top-level > NearSpansOrdered advances after one attempt at matching, and the valid > match is missed. > > Pending fixes to address the underlying issue (there is a candidate > patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312), > you could mitigate the problem to some extent by either forcing slop>0 > (which as of 7.6 will be expanded into MultiPhraseQuery -- see > https://issues.apache.org/jira/browse/LUCENE-8531), or you could set > preserveOriginal=true on both index-time and query-time WDGF and > upgrade to 8.1 (which would prevent the extreme case of an *exact* > character-for-character matching query turning up no results -- see > https://issues.apache.org/jira/browse/LUCENE-8730). > > On Fri, May 17, 2019 at 11:47 AM Erick Erickson <erickerick...@gmail.com> > wrote: > > > > I’ll leave that explanation to someone who understands query parsers ;) > > > > > On May 17, 2019, at 7:57 AM, Doris Peter <doris.pe...@bsb-muenchen.de> > > > wrote: > > > > > > Thanks a lot! I tried the debug parameter, which shows interesting > > > differences: > > > > > > debug": { > > > > > > "rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"", > > > "querystring": "all_places_txt:\"Neuburg a. d. Donau\"", > > > "parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")", > > > "parsedquery_toString": "all_places_txt:\"neuburg a d donau\"", > > > "QParser": "LuceneQParser" > > > } > > > > > > debug": { > > > "rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"", > > > "querystring": "all_places_txt:\"Neuburg a.d. Donau\"", > > > "parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, > > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], > > > 0, true)]), all_places_txt:donau], 0, true))", > > > "parsedquery_toString": "spanNear([all_places_txt:neuburg, > > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], > > > 0, true)]), all_places_txt:donau], 0, true)", > > > "QParser": "LuceneQParser" > > > } > > > > > > > > > Something seems to go wrong here, as the parsedquery contains the > > > SpanNearQuery instead of a PhraseQuery. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >>>> Erick Erickson <erickerick...@gmail.com> 5/17/2019 4:27 PM >>> > > > Three things: > > > > > > 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory > > > after it in the index config > > > > > > 2> It is usually unnecessary to have the exact same parameters at both > > > query and index time for WDGFF. If you’ve split parts up at index time > > > then mashed them all back together, you can usually only split them up at > > > query time. > > > > > > 3> try adding &debug=query to the query and see what the results show for > > > the parsed query. That usually gives you a clue what is really happening > > > .vs. what you think is happening. > > > > > > Best, > > > Erick > > > > > >> On May 17, 2019, at 12:59 AM, Doris Peter <doris.pe...@bsb-muenchen.de> > > >> wrote: > > >> > > >> Hello, > > >> > > >> We use Solr 7.6.0 to build our index, and I have got a Question about > > >> Phrase Queries: > > >> > > >> We use the following configuration in schema.xml: > > >> > > >> <!-- Text Standard --> > > >> <fieldType name="text" class="solr.TextField" > > >> positionIncrementGap="1000" sortMissingLast="true" > > >> autoGeneratePhraseQueries="true"> > > >> <analyzer type="index"> > > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > >> <charFilter class="solr.MappingCharFilterFactory" > > >> mapping="mapping-FoldToASCII.txt"/> > > >> <filter class="solr.CJKBigramFilterFactory"/> > > >> <filter class="solr.WordDelimiterGraphFilterFactory" > > >> protected="protectedword.txt" > > >> preserveOriginal="0" splitOnNumerics="1" > > >> splitOnCaseChange="0" > > >> catenateWords="1" catenateNumbers="1" catenateAll="1" > > >> generateWordParts="1" generateNumberParts="1" > > >> stemEnglishPossessive="1" > > >> types="wdfftypes.txt" /> > > >> <filter class="solr.LengthFilterFactory" min="1" > > >> max="2147483647"/> > > >> <filter class="solr.LowerCaseFilterFactory"/> > > >> </analyzer> > > >> <analyzer type="query"> > > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > >> <charFilter class="solr.MappingCharFilterFactory" > > >> mapping="mapping-FoldToASCII.txt"/> > > >> <filter class="solr.CJKBigramFilterFactory"/> > > >> <filter class="solr.WordDelimiterGraphFilterFactory" > > >> protected="protectedword.txt" > > >> preserveOriginal="0" splitOnNumerics="1" > > >> splitOnCaseChange="0" > > >> catenateWords="1" catenateNumbers="1" catenateAll="1" > > >> generateWordParts="1" generateNumberParts="1" > > >> stemEnglishPossessive="1" > > >> types="wdfftypes.txt" /> > > >> <filter class="solr.LengthFilterFactory" min="1" > > >> max="2147483647"/> > > >> <filter class="solr.LowerCaseFilterFactory"/> > > >> </analyzer> > > >> </fieldType> > > >> > > >> > > >> If we search for a phrase like "Moosburg a.d. Isar" we don't get a > > >> match, though it's definitely in our Index. > > >> If we search for "Moosburg a. d. Isar" with a blank between "a." > > >> and "d." we get a match. > > >> > > >> This also happens for other non-word characters, like ' or , for > > >> example. > > >> > > >> The strange thing about it is, that the Solr Analysis-Tool reports > > >> a match for the first version, but when we send a Solr Query, we get no > > >> result Documents. > > >> > > >> Has anyone got an idea, what this could be? > > >> > > >> Thank you very much in advance, > > >> > > >> Doris Peter > > > > > > > >