Thank you for the explanation. To close the loop, I was able to track the problem down to the Lucene Query parser on 5.2.1 which returned +body:"123 234 345 456" for a query string 123456.
Turned out that It is possible to get the same behavior by turning on split on white-space and auto Generate Phrase Queries when using NgramTokenizerFactory. On Mon, Jul 2, 2018 at 3:24 PM Alexandre Rafalovitch <arafa...@gmail.com> wrote: > I am not familiar with Lucene method to create analyzer. Perhaps it > was already doing just analyzes phase. But here is what the NGram > would do to a string of '123456' with just trigrams: > 123 > 234 > 345 > 456 > > So, if you only apply it on the index side, and your query is '2345' - > there is no such token in the index to match against. > > On the other hand, if you apply trigram on the query side as well, > against the query '2349', it will split into: > 234 > 349 > > And 234 would match. If that's ok for you that 2349 would match > against 123456, you are fine. But if you want any search string to be > actually present fully, then you need index-only NGram and it needs to > be maxed at your maximum possible string. > > So with index-only min=3 and max=4, you will get: > 123 > 1234 > 234 > 2345 > 345 > 3456 > 456 > > Then 2349, not being ngrammed will not match anything, but 2345 will. > > Again, Admin UI will show that to you. > > Regards, > Alex. > > On 2 July 2018 at 14:33, Kudrettin Güleryüz <kudret...@gmail.com> wrote: > >> 1) if you want face to match interface, you need max value to be at > least > > 4. > > Can you please explain this a bit more? I am not following this one. > Values > > are set to 3,3 and Solr already matches interface and interfaces when > > searched for face. In addition to that Solr matches the trigrams of face > > (fac and ace) as well, which I find not as relevant as interface or > faceted. > > > > Application I am working on moving to Solr 7.3.1 is currently using > Lucene > > API 5.3.1 and has a custom analyzer like following: > > > > > > public class TrigramCaseAnalyzer extends SourceSearchAnalyzer { > > private int indexType; > > > > public TrigramCaseAnalyzer() { > > indexType = 1; > > } > > > > @Override > > public int getIndexType() { > > return this.indexType; > > } > > > > @Override > > public void setIndexType(int type) { > > this.indexType = type; > > } > > > > @Override > > protected TokenStreamComponents createComponents(String fieldName) { > > Tokenizer st; > > st = new NGramTokenizer(3, 3); > > return new TokenStreamComponents(st); > > } > > } > > > > This somehow behaves as I described. (for a search: face returns > interface > > face faceted but not fac or ace). > > > > Is there a change since 5.3.1 regarding this behavious in Lucene? Or is > the > > difference in behaviour caused by Solr's implementation of the Lucene > API? > > > > Thank you > > > > > > On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch <arafa...@gmail.com > > > > wrote: > > > >> Two things: > >> 1) if you want face to match interface, you need max value to be at > least > >> 4. > >> 2) you probably have the factory symmetrically or on Query analyzer. You > >> probably want it on Index analyzer side only. Otherwise you are trying > to > >> match any 3-letter query substring against yoir index. > >> > >> Admin UI analysis screen will show that to you. > >> > >> Regards, > >> Alex > >> > >> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <kudret...@gmail.com> > >> wrote: > >> > >> > Hi, > >> > > >> > When using NgramTokenizerFactory with settings min ngram size=3 and > max > >> > ngram size=3 I get the following behaviour. > >> > > >> > Assume that search term is, face > >> > > >> > I expect the results to show documents with strings: > >> > * interface or > >> > * face or > >> > * faceted > >> > > >> > but not > >> > * ace or > >> > * fac > >> > > >> > Why would I get the matches with results ace or fac? Am I missing some > >> > settings somewhere? What is the suggested way to change this this > >> > behaviour? > >> > > >> > Thank you, > >> > > >> >