Re: NgramTokenizerFactory question

Kudrettin Güleryüz Thu, 05 Jul 2018 10:31:06 -0700

Thank you for the explanation.

To close the loop, I was able to track the problem down to the Lucene Query
parser on 5.2.1 which returned +body:"123 234 345 456" for a query string
123456.


Turned out that It is possible to get the same behavior by turning on split
on white-space and auto Generate Phrase Queries when using
NgramTokenizerFactory.



On Mon, Jul 2, 2018 at 3:24 PM Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> I am not familiar with Lucene method to create analyzer. Perhaps it
> was already doing just analyzes phase. But here is what the NGram
> would do to a string of '123456' with just trigrams:
> 123
> 234
> 345
> 456
>
> So, if you only apply it on the index side, and your query is '2345' -
> there is no such token in the index to match against.
>
> On the other hand, if you apply trigram on the query side as well,
> against the query '2349', it will split into:
> 234
> 349
>
> And 234 would match. If that's ok for you that 2349 would match
> against 123456, you are fine. But if you want any search string to be
> actually present fully, then you need index-only NGram and it needs to
> be maxed at your maximum possible string.
>
> So with index-only min=3 and max=4, you will get:
> 123
> 1234
> 234
> 2345
> 345
> 3456
> 456
>
> Then 2349, not being ngrammed will not match anything, but 2345 will.
>
> Again, Admin UI will show that to you.
>
> Regards,
>    Alex.
>
> On 2 July 2018 at 14:33, Kudrettin Güleryüz <kudret...@gmail.com> wrote:
> >> 1) if you want face to match interface, you need max value to be at
> least
> > 4.
> > Can you please explain this a bit more? I am not following this one.
> Values
> > are set to 3,3 and Solr already matches interface and interfaces when
> > searched for face.  In addition to that Solr matches the trigrams of face
> > (fac and ace) as well, which I find not as relevant as interface or
> faceted.
> >
> > Application I am working on moving to Solr 7.3.1 is currently using
> Lucene
> > API 5.3.1 and has a custom analyzer like following:
> >
> >
> > public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
> >     private int indexType;
> >
> >     public TrigramCaseAnalyzer() {
> >         indexType = 1;
> >     }
> >
> >     @Override
> >     public int getIndexType() {
> >         return this.indexType;
> >     }
> >
> >     @Override
> >     public void setIndexType(int type) {
> >         this.indexType = type;
> >     }
> >
> >     @Override
> >     protected TokenStreamComponents createComponents(String fieldName) {
> >         Tokenizer st;
> >         st = new NGramTokenizer(3, 3);
> >         return new TokenStreamComponents(st);
> >     }
> > }
> >
> > This somehow behaves as I described. (for a search: face returns
> interface
> > face faceted but not fac or ace).
> >
> > Is there a change since 5.3.1 regarding this behavious in Lucene? Or is
> the
> > difference in behaviour caused by Solr's implementation of the Lucene
> API?
> >
> > Thank you
> >
> >
> > On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch <arafa...@gmail.com
> >
> > wrote:
> >
> >> Two things:
> >> 1) if you want face to match interface, you need max value to be at
> least
> >> 4.
> >> 2) you probably have the factory symmetrically or on Query analyzer. You
> >> probably want it on Index analyzer side only. Otherwise you are trying
> to
> >> match any 3-letter query substring against yoir index.
> >>
> >> Admin UI analysis screen will show that to you.
> >>
> >> Regards,
> >>     Alex
> >>
> >> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <kudret...@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > When using NgramTokenizerFactory with settings min ngram size=3 and
> max
> >> > ngram size=3 I get the following behaviour.
> >> >
> >> > Assume that search term is, face
> >> >
> >> > I expect the results to show documents with strings:
> >> > * interface or
> >> > * face or
> >> > * faceted
> >> >
> >> > but not
> >> > * ace or
> >> > * fac
> >> >
> >> > Why would I get the matches with results ace or fac? Am I missing some
> >> > settings somewhere? What is the suggested way to change this this
> >> > behaviour?
> >> >
> >> > Thank you,
> >> >
> >>
>

Re: NgramTokenizerFactory question

Reply via email to