[
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356754#comment-14356754
]
Shai Erera commented on LUCENE-6336:
------------------------------------
So I wrote these two simple tests:
+FuzzySuggester+
{code}
public void testDuplicateInput() throws Exception {
Input keys[] = new Input[] {
new Input("duplicate", 8),
new Input("duplicate", 12),
new Input("duplicate", 12),
};
Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.WHITESPACE,
true, MockTokenFilter.ENGLISH_STOPSET);
FuzzySuggester suggester = new FuzzySuggester(analyzer, analyzer,
AnalyzingSuggester.EXACT_FIRST | AnalyzingSuggester.PRESERVE_SEP, 256,
-1, false, FuzzySuggester.DEFAULT_MAX_EDITS,
FuzzySuggester.DEFAULT_TRANSPOSITIONS,
FuzzySuggester.DEFAULT_NON_FUZZY_PREFIX,
FuzzySuggester.DEFAULT_MIN_FUZZY_LENGTH,
FuzzySuggester.DEFAULT_UNICODE_AWARE);
suggester.build(new InputArrayIterator(keys));
List<LookupResult> results =
suggester.lookup(TestUtil.stringToCharSequence("dup", random()), false, 1);
System.out.println(results);
analyzer.close();
}
{code}
This prints:
{code}
[duplicate/12]
{code}
+AnalyzingInfixSuggester+
{code}
public void testDuplicateInput() throws Exception {
Input keys[] = new Input[] {
new Input("duplicate", 8),
new Input("duplicate", 12),
new Input("duplicate", 12),
};
Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false);
AnalyzingInfixSuggester suggester = new
AnalyzingInfixSuggester(newDirectory(), a, a, 3, false);
suggester.build(new InputArrayIterator(keys));
List<LookupResult> results =
suggester.lookup(TestUtil.stringToCharSequence("dup", random()), 10, true,
true);
System.out.println(results);
suggester.close();
}
{code}
Prints:
{code}
[duplicate/12, duplicate/12, duplicate/8]
{code}
Both tests use an {{InputArrayIterator}} and the same {{.buikd()}} API - the
only thing that's different is the Suggester type. So if I think about a
component in my software that gets a {{Lookup}} and uses the common API to
populate values in it and lookup, it shouldn't care about the type of the
Lookup instance (right?).
Would be good if we can be consistent IMO, but I know that there is a
fundamental difference between the two suggesters -- Fuzzy builds an FST, which
I think is the component that resolves the duplicates, while
AnalyzingInfixSuggester builds an index. Perhaps in its createResult method it
can add the results to a Set (or in addition to the List) to resolve the
duplicates at lookup time. Of course it would be better if it can detect the
lookups at build() or .add() time and avoid their matches in the first place.
Usually suggesters handle unique values, and the question is who should ensure
the values they are given as input is unique -- is it the Suggester or the
user. That that FuzzySuggester happens to use a data structure that resolves
the duplicates is a side effect IMO. AnalyzingInfixSuggester also take a
context with each value to suggest, so the value "foo" isn't the same if input
twice with different contexts. Therefore it's more involved IMO with
AnalyzingInfix vs Fuzzy ...
I'm also not sure that the Suggester is the one that should take care of
uniqueness because the added logic will be executed for all users, whether
their input values are unique or not. But if for example we could have
DocumentDictionary resolve duplicates, then we would leave the suggester do
what it should do -- suggest from a given list of values. I like that
simplicity in responsibility.
> AnalyzingInfixSuggester needs duplicate handling
> ------------------------------------------------
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 4.10.3, 5.0
> Reporter: Jan Høydahl
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index
> with multiple documents containing the same text, but with random weights
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
> "suggest":{"languages":{
> "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"<b>Engl</b>ish",
> "weight":100,
> "payload":"0"},
> {
> "term":"<b>Engl</b>ish",
> "weight":99,
> "payload":"0"},
> {
> "term":"<b>Engl</b>ish",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So
> there is a need for some duplicate removal here, either while building the
> local suggest index or during lookup. Only the highest weight suggestion for
> a given term should be returned.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]