[ 
https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3714:
--------------------------------

    Attachment: LUCENE-3714.patch

patch with a prototype suggester.

there are many nocommits: especially the encoding of weights is very 
inefficient and slows/bloats the fst (but easy for a prototype).

We should make a better Outputs class with the algebra we need (e.g. max) and 
maybe think about floats in this suggester API (it seems from the API names 
like TermFreq that these are really frequencies so maybe we should define them 
as ints? I don't like floats because of all the hassles like NaN/inf)

But even with this, the performance seems promising:

{noformat}
    [junit] ------------- Standard Error -----------------
    [junit] -- prefixes: 100-200, num: 7, onlyMorePopular: true
    [junit] JaspellLookup   queries: 50001, time[ms]: 91 [+- 6.59], ~qps: 547
    [junit] TSTLookup       queries: 50001, time[ms]: 33 [+- 1.55], ~qps: 1532
    [junit] FSTCompletionLookup queries: 50001, time[ms]: 163 [+- 13.04], ~qps: 
307
    [junit] WFSTCompletionLookup queries: 50001, time[ms]: 55 [+- 1.50], ~qps: 
910
    [junit] -- construction time
    [junit] JaspellLookup   input: 50001, time[ms]: 23 [+- 0.93]
    [junit] TSTLookup       input: 50001, time[ms]: 31 [+- 1.25]
    [junit] FSTCompletionLookup input: 50001, time[ms]: 72 [+- 1.73]
    [junit] WFSTCompletionLookup input: 50001, time[ms]: 69 [+- 2.50]
    [junit] -- RAM consumption
    [junit] JaspellLookup   size[B]:    7,869,415
    [junit] TSTLookup       size[B]:    7,914,524
    [junit] FSTCompletionLookup size[B]:      466,051
    [junit] WFSTCompletionLookup size[B]:      506,662
    [junit] -- prefixes: 6-9, num: 7, onlyMorePopular: true
    [junit] JaspellLookup   queries: 50001, time[ms]: 129 [+- 1.37], ~qps: 389
    [junit] TSTLookup       queries: 50001, time[ms]: 139 [+- 2.30], ~qps: 361
    [junit] FSTCompletionLookup queries: 50001, time[ms]: 177 [+- 1.62], ~qps: 
282
    [junit] WFSTCompletionLookup queries: 50001, time[ms]: 85 [+- 12.74], ~qps: 
592
    [junit] -- prefixes: 2-4, num: 7, onlyMorePopular: true
    [junit] JaspellLookup   queries: 50001, time[ms]: 363 [+- 16.75], ~qps: 138
    [junit] TSTLookup       queries: 50001, time[ms]: 1112 [+- 22.57], ~qps: 45
    [junit] FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 3.16], ~qps: 
356
    [junit] WFSTCompletionLookup queries: 50001, time[ms]: 242 [+- 4.01], ~qps: 
207
    [junit] ------------- ---------------- ---------------
{noformat}
                
> add suggester that uses shortest path/wFST instead of buckets
> -------------------------------------------------------------
>
>                 Key: LUCENE-3714
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3714
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/spellchecker
>            Reporter: Robert Muir
>         Attachments: LUCENE-3714.patch, LUCENE-3714.patch, LUCENE-3714.patch, 
> TestMe.java, out.png
>
>
> Currently the FST suggester (really an FSA) quantizes weights into buckets 
> (e.g. single byte) and puts them in front of the word.
> This makes it fast, but you lose granularity in your suggestions.
> Lately the question was raised, if you build lucene's FST with 
> positiveintoutputs, does it behave the same as a tropical semiring wFST?
> In other words, after completing the word, we instead traverse min(output) at 
> each node to find the 'shortest path' to the 
> best suggestion (with the highest score).
> This means we wouldnt need to quantize weights at all and it might make some 
> operations (e.g. adding fuzzy matching etc) a lot easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to