[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

Mike Sokolov (JIRA) Sat, 17 Nov 2018 06:31:08 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690571#comment-16690571
 ]


Mike Sokolov commented on LUCENE-6336:
--------------------------------------

I dug into this a bit - it seems that we already do provide 
{{SortedInputIterator}} in Lucene-land, but it is not used by 
{{DocumentExpressionDictionary}} and its factory. It seems to me that could 
expose an option for de-duping. Wouldn't want to make it the default, since 
your dictionary might already be unique and you wouldn't want to pay the 
penalty for sorting in that case. If we agree that is the solution, I think 
this issue should get moved over to Solr, and in that case the unit test in the 
patch isn't really pointing at the problem.

It's certainly possible to subclass 
{{DocumentExpressionDictionaryFactory.create(...)}} and  
{{DocumentExpressionDictionary.getEntryIterator()}} to wrap the original 
iterator with  {{SortedInputIterator}}, but this does require some Java 
programming.


> AnalyzingInfixSuggester needs duplicate handling
> ------------------------------------------------
>
>                 Key: LUCENE-6336
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6336
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.10.3, 5.0
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Major
>              Labels: lookup, suggester
>         Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>       "engl":{
>         "numFound":101,
>         "suggestions":[{
>             "term":"<b>Engl</b>ish",
>             "weight":100,
>             "payload":"0"},
>           {
>             "term":"<b>Engl</b>ish",
>             "weight":99,
>             "payload":"0"},
>           {
>             "term":"<b>Engl</b>ish",
>             "weight":98,
>             "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

Reply via email to