There are many layers to this, but for the config you posted (applying index-time WDGF configured to both split and catentate tokens), the fundamental issue is that Lucene doesn't index positionLength, so the graph structure (and token adjacency information) of the token stream is lost when it's serialized to the index. Once the positionLength information is discarded, it's impossible to restore/leverage it at query time.
For now, if you use WGDF (or any analysis component capable of generating "graph"-type output) at index-time, you'll have issues unless you configure it such that it won't in practice generate graph output. For WGDF this would mean either catenate output, or split output, but not both on a single analysis chain. If you need both, one option would be to index to (and search on) two fields: one for catentated analysis, one for split analysis. Graph output *is* respected at query-time, so you have more options configuring WGDF on a query-time analyzer. But in that case, it's worth being aware of the potential for exponential query expansion (see discussion at https://issues.apache.org/jira/browse/SOLR-13336, which restores a safety valve for extreme instances of this case). Some other potentially relevant issues/links: https://issues.apache.org/jira/browse/LUCENE-4312 https://issues.apache.org/jira/browse/LUCENE-7398 https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch (Lucene, so applies also to Solr) https://michaelgibney.net/lucene/graph/ On Wed, Feb 19, 2020 at 10:27 AM Jeroen Steggink | knowsy <jer...@knowsy.nl> wrote: > > Hi, > > I have a question regarding phrase search in combination with a > WordDelimiterGraphFilter (Solr 8.4.1). > > Whenever I try to search using a phrase where token combination consists > of delimited and non-delimited tokens, I don't get any matches. > > This is the configuration: > > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.ASCIIFoldingFilterFactory"/> > <filter class="solr.WordDelimiterGraphFilterFactory" > generateWordParts="1" > generateNumberParts="1" > catenateWords="1" > catenateNumbers="0" > catenateAll="0" > splitOnCaseChange="1" > preserveOriginal="1"/> > <filter class="solr.FlattenGraphFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.ASCIIFoldingFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > <field name="text" type="text" indexed="true" stored="true" > omitTermFreqAndPositions="false" /> > > > Example document: > > { > id: '1', > text: 'mr. i.n.i.t. firstsirname secondsirname' > } > > Queries and results: > > Query: > "mr. i.n.i.t. firstsirname" > ----- > No result > > Query: > "mr. i.n.i.t." > ----- > Result > > Query: > "mr. i n i t" > ----- > Result > > Query: > "mr. init" > ----- > Result > > Query: > "mr init" > ----- > Result > > Query: > "i.n.i.t. firstsirname" > ----- > No result > > Query: > "init firstsirname" > ----- > No result > > Query: > "i.n.i.t. firstsirname secondsirname" > ----- > No result > > Query: > "init firstsirname secondsirname" > ----- > No result > > > I don't quite understand why this is. When looking at the results of the > analyzers I don't understand why it's working with just delimited or > non-delimited tokens. However, as soon as the mixed combination of > delimited and non-delimited is searched, there is no match. > > Could someone explain? And is there a solution to make it work? > > Best regards, > > Jeroen > >