[jira] [Updated] (LUCENE-4481) AnalyzingSuggester may fail to return correct topN suggestions

Michael McCandless (JIRA) Fri, 19 Oct 2012 12:19:14 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-4481:
---------------------------------------

    Attachment: LUCENE-4481.patch

New patch, with test for the third bug and simplest fix (unbounded
queue).  All tests pass ... so this is the starting point, and we can
now (separately) try to add back the optimizations.

Here's LookupBenchmarkTest with the patch:

{noformat}
[junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest
[junit4:junit4]   2> -- construction time
[junit4:junit4]   2> JaspellLookup   input: 50001, time[ms]: 23 [+- 4.29]
[junit4:junit4]   2> TSTLookup       input: 50001, time[ms]: 71 [+- 5.10]
[junit4:junit4]   2> FSTCompletionLookup input: 50001, time[ms]: 121 [+- 7.23]
[junit4:junit4]   2> WFSTCompletionLookup input: 50001, time[ms]: 89 [+- 4.90]
[junit4:junit4]   2> AnalyzingSuggester input: 50001, time[ms]: 237 [+- 27.55]
[junit4:junit4] OK      11.9s | LookupBenchmarkTest.testConstructionTime
[junit4:junit4]   2> -- prefixes: 2-4, num: 7, onlyMorePopular: true
[junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 271 [+- 3.83], 
~kQPS: 185
[junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 732 [+- 8.97], 
~kQPS: 68
[junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 121 [+- 
4.34], ~kQPS: 413
[junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 338 [+- 
4.64], ~kQPS: 148
[junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 791 [+- 
7.66], ~kQPS: 63
[junit4:junit4] OK      46.1s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4
[junit4:junit4]   2> -- prefixes: 6-9, num: 7, onlyMorePopular: true
[junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 101 [+- 3.26], 
~kQPS: 496
[junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 88 [+- 2.78], 
~kQPS: 568
[junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 151 [+- 
3.36], ~kQPS: 332
[junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 77 [+- 
3.45], ~kQPS: 646
[junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 272 [+- 
3.87], ~kQPS: 184
[junit4:junit4] OK      14.4s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9
[junit4:junit4]   2> -- RAM consumption
[junit4:junit4]   2> JaspellLookup   size[B]:    9,815,152
[junit4:junit4]   2> TSTLookup       size[B]:    9,858,792
[junit4:junit4]   2> FSTCompletionLookup size[B]:      466,520
[junit4:junit4]   2> WFSTCompletionLookup size[B]:      507,640
[junit4:junit4]   2> AnalyzingSuggester size[B]:      889,138
[junit4:junit4] OK      0.74s | LookupBenchmarkTest.testStorageNeeds
[junit4:junit4]   2> -- prefixes: 100-200, num: 7, onlyMorePopular: true
[junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 71 [+- 3.14], 
~kQPS: 702
[junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 32 [+- 0.74], 
~kQPS: 1561
[junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 145 [+- 
3.60], ~kQPS: 344
[junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 49 [+- 
4.97], ~kQPS: 1029
[junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 235 [+- 
3.52], ~kQPS: 212
[junit4:junit4] OK      11.3s | LookupBenchmarkTest.testPerformanceOnFullHits
[junit4:junit4] Completed in 84.81s, 5 tests
{noformat}

And on trunk:

{noformat}
[junit4:junit4] <JUnit4> says olá! Master seed: 827F8DD5C0F3472D
[junit4:junit4] Executing 1 suite with 1 JVM.
[junit4:junit4] 
[junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest
[junit4:junit4]   2> -- prefixes: 6-9, num: 7, onlyMorePopular: true
[junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 114 [+- 2.83], 
~kQPS: 439
[junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 66 [+- 2.06], 
~kQPS: 762
[junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 138 [+- 
2.13], ~kQPS: 362
[junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 69 [+- 
4.75], ~kQPS: 725
[junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 260 [+- 
5.00], ~kQPS: 192
[junit4:junit4] OK      15.0s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9
[junit4:junit4]   2> -- construction time
[junit4:junit4]   2> JaspellLookup   input: 50001, time[ms]: 22 [+- 3.19]
[junit4:junit4]   2> TSTLookup       input: 50001, time[ms]: 64 [+- 2.32]
[junit4:junit4]   2> FSTCompletionLookup input: 50001, time[ms]: 120 [+- 2.99]
[junit4:junit4]   2> WFSTCompletionLookup input: 50001, time[ms]: 86 [+- 1.40]
[junit4:junit4]   2> AnalyzingSuggester input: 50001, time[ms]: 232 [+- 3.98]
[junit4:junit4] OK      10.7s | LookupBenchmarkTest.testConstructionTime
[junit4:junit4]   2> -- prefixes: 100-200, num: 7, onlyMorePopular: true
[junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 72 [+- 2.92], 
~kQPS: 694
[junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 32 [+- 3.12], 
~kQPS: 1556
[junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 
1.21], ~kQPS: 356
[junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 45 [+- 
1.74], ~kQPS: 1102
[junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 233 [+- 
9.29], ~kQPS: 215
[junit4:junit4] OK      11.0s | LookupBenchmarkTest.testPerformanceOnFullHits
[junit4:junit4]   2> -- prefixes: 2-4, num: 7, onlyMorePopular: true
[junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 257 [+- 3.21], 
~kQPS: 194
[junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 510 [+- 5.35], 
~kQPS: 98
[junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 119 [+- 
3.17], ~kQPS: 421
[junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 240 [+- 
5.40], ~kQPS: 208
[junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 595 [+- 
8.07], ~kQPS: 84
[junit4:junit4] OK      35.1s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4
[junit4:junit4]   2> -- RAM consumption
[junit4:junit4]   2> JaspellLookup   size[B]:    9,815,152
[junit4:junit4]   2> TSTLookup       size[B]:    9,858,792
[junit4:junit4]   2> FSTCompletionLookup size[B]:      466,520
[junit4:junit4]   2> WFSTCompletionLookup size[B]:      507,640
[junit4:junit4]   2> AnalyzingSuggester size[B]:      889,138
[junit4:junit4] OK      0.86s | LookupBenchmarkTest.testStorageNeeds
[junit4:junit4] Completed in 72.97s, 5 tests
{noformat}

So the lookup is definitely slower .. WFSTCompletionLookup is most
heavily affected.

                
> AnalyzingSuggester may fail to return correct topN suggestions
> --------------------------------------------------------------
>
>                 Key: LUCENE-4481
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4481
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-4481.patch, LUCENE-4481.patch, LUCENE-4481.patch
>
>
> I hit this when working on LUCENE-4480.
> Because AnalyzingSuggester may prune some of the topN paths found by FST's 
> Util.TopNSearcher, this means the queue size limit of topN makes the overall 
> search inadmissible, ie it may incorrectly prune paths that would have lead 
> to a competitive path.
> However, such pruning is rare: it happens only for graph token streams, and 
> even then only when competitive analyzed forms share the same surface forms.
> The simplest way to fix this is to make the queue unbounded but this is 
> likely a sizable performance hit ... I haven't tested yet.  It's even 
> possible the way the dups happen (always at the "end" of the suggestion, 
> because we tack on 0 byte followed by ord dedup byte) prevent this bug from 
> even occurring and so this could all be a false alarm!  I have to try to make 
> a test case showing it ...
> A cop-out solution would be to expose a separate queueSize or queueMultiplier 
> (over the topN) so that if users are affected by this they could crank up the 
> queue size or multiplier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4481) AnalyzingSuggester may fail to return correct topN suggestions

Reply via email to