[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356754#comment-14356754
 ] 

Shai Erera commented on LUCENE-6336:
------------------------------------

So I wrote these two simple tests:

+FuzzySuggester+
{code}
  public void testDuplicateInput() throws Exception {
    Input keys[] = new Input[] {
        new Input("duplicate", 8),
        new Input("duplicate", 12),
        new Input("duplicate", 12),
    };
    
    Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, 
true, MockTokenFilter.ENGLISH_STOPSET);
    FuzzySuggester suggester = new FuzzySuggester(analyzer, analyzer,
        AnalyzingSuggester.EXACT_FIRST | AnalyzingSuggester.PRESERVE_SEP, 256,
        -1, false, FuzzySuggester.DEFAULT_MAX_EDITS,
        FuzzySuggester.DEFAULT_TRANSPOSITIONS,
        FuzzySuggester.DEFAULT_NON_FUZZY_PREFIX,
        FuzzySuggester.DEFAULT_MIN_FUZZY_LENGTH,
        FuzzySuggester.DEFAULT_UNICODE_AWARE);
    suggester.build(new InputArrayIterator(keys));
    
    List<LookupResult> results = 
suggester.lookup(TestUtil.stringToCharSequence("dup", random()), false, 1);
    System.out.println(results);
   
    analyzer.close();
  }
{code}

This prints:

{code}
[duplicate/12]
{code}

+AnalyzingInfixSuggester+
{code}
  public void testDuplicateInput() throws Exception {
    Input keys[] = new Input[] {
        new Input("duplicate", 8),
        new Input("duplicate", 12),
        new Input("duplicate", 12),
    };
    
    Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false);
    AnalyzingInfixSuggester suggester = new 
AnalyzingInfixSuggester(newDirectory(), a, a, 3, false);
    suggester.build(new InputArrayIterator(keys));
    
    List<LookupResult> results = 
suggester.lookup(TestUtil.stringToCharSequence("dup", random()), 10, true, 
true);
    System.out.println(results);
    
    suggester.close();
  }
{code}

Prints:

{code}
[duplicate/12, duplicate/12, duplicate/8]
{code}

Both tests use an {{InputArrayIterator}} and the same {{.buikd()}} API - the 
only thing that's different is the Suggester type. So if I think about a 
component in my software that gets a {{Lookup}} and uses the common API to 
populate values in it and lookup, it shouldn't care about the type of the 
Lookup instance (right?).

Would be good if we can be consistent IMO, but I know that there is a 
fundamental difference between the two suggesters -- Fuzzy builds an FST, which 
I think is the component that resolves the duplicates, while 
AnalyzingInfixSuggester builds an index. Perhaps in its createResult method it 
can add the results to a Set (or in addition to the List) to resolve the 
duplicates at lookup time. Of course it would be better if it can detect the 
lookups at build() or .add() time and avoid their matches in the first place.

Usually suggesters handle unique values, and the question is who should ensure 
the values they are given as input is unique -- is it the Suggester or the 
user. That that FuzzySuggester happens to use a data structure that resolves 
the duplicates is a side effect IMO. AnalyzingInfixSuggester also take a 
context with each value to suggest, so the value "foo" isn't the same if input 
twice with different contexts. Therefore it's more involved IMO with 
AnalyzingInfix vs Fuzzy ...

I'm also not sure that the Suggester is the one that should take care of 
uniqueness because the added logic will be executed for all users, whether 
their input values are unique or not. But if for example we could have 
DocumentDictionary resolve duplicates, then we would leave the suggester do 
what it should do -- suggest from a given list of values. I like that 
simplicity in responsibility.

> AnalyzingInfixSuggester needs duplicate handling
> ------------------------------------------------
>
>                 Key: LUCENE-6336
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6336
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.10.3, 5.0
>            Reporter: Jan Høydahl
>             Fix For: Trunk, 5.1
>
>         Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>       "engl":{
>         "numFound":101,
>         "suggestions":[{
>             "term":"<b>Engl</b>ish",
>             "weight":100,
>             "payload":"0"},
>           {
>             "term":"<b>Engl</b>ish",
>             "weight":99,
>             "payload":"0"},
>           {
>             "term":"<b>Engl</b>ish",
>             "weight":98,
>             "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to