Re: NgramTokenizerFactory question

Alexandre Rafalovitch Mon, 02 Jul 2018 12:24:50 -0700

I am not familiar with Lucene method to create analyzer. Perhaps it
was already doing just analyzes phase. But here is what the NGram
would do to a string of '123456' with just trigrams:
123
234
345
456


So, if you only apply it on the index side, and your query is '2345' -
there is no such token in the index to match against.

On the other hand, if you apply trigram on the query side as well,
against the query '2349', it will split into:
234
349

And 234 would match. If that's ok for you that 2349 would match
against 123456, you are fine. But if you want any search string to be
actually present fully, then you need index-only NGram and it needs to
be maxed at your maximum possible string.

So with index-only min=3 and max=4, you will get:
123
1234
234
2345
345
3456
456

Then 2349, not being ngrammed will not match anything, but 2345 will.

Again, Admin UI will show that to you.

Regards,
   Alex.

On 2 July 2018 at 14:33, Kudrettin Güleryüz <kudret...@gmail.com> wrote:
>> 1) if you want face to match interface, you need max value to be at least
> 4.
> Can you please explain this a bit more? I am not following this one. Values
> are set to 3,3 and Solr already matches interface and interfaces when
> searched for face.  In addition to that Solr matches the trigrams of face
> (fac and ace) as well, which I find not as relevant as interface or faceted.
>
> Application I am working on moving to Solr 7.3.1 is currently using Lucene
> API 5.3.1 and has a custom analyzer like following:
>
>
> public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
>     private int indexType;
>
>     public TrigramCaseAnalyzer() {
>         indexType = 1;
>     }
>
>     @Override
>     public int getIndexType() {
>         return this.indexType;
>     }
>
>     @Override
>     public void setIndexType(int type) {
>         this.indexType = type;
>     }
>
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName) {
>         Tokenizer st;
>         st = new NGramTokenizer(3, 3);
>         return new TokenStreamComponents(st);
>     }
> }
>
> This somehow behaves as I described. (for a search: face returns interface
> face faceted but not fac or ace).
>
> Is there a change since 5.3.1 regarding this behavious in Lucene? Or is the
> difference in behaviour caused by Solr's implementation of the Lucene API?
>
> Thank you
>
>
> On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> Two things:
>> 1) if you want face to match interface, you need max value to be at least
>> 4.
>> 2) you probably have the factory symmetrically or on Query analyzer. You
>> probably want it on Index analyzer side only. Otherwise you are trying to
>> match any 3-letter query substring against yoir index.
>>
>> Admin UI analysis screen will show that to you.
>>
>> Regards,
>>     Alex
>>
>> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <kudret...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > When using NgramTokenizerFactory with settings min ngram size=3 and max
>> > ngram size=3 I get the following behaviour.
>> >
>> > Assume that search term is, face
>> >
>> > I expect the results to show documents with strings:
>> > * interface or
>> > * face or
>> > * faceted
>> >
>> > but not
>> > * ace or
>> > * fac
>> >
>> > Why would I get the matches with results ace or fac? Am I missing some
>> > settings somewhere? What is the suggested way to change this this
>> > behaviour?
>> >
>> > Thank you,
>> >
>>

Re: NgramTokenizerFactory question

Reply via email to