[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets

Michael McCandless (Commented) (JIRA) Sat, 21 Jan 2012 13:46:03 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190536#comment-13190536
 ]


Michael McCandless commented on LUCENE-3714:
--------------------------------------------

Dawid, by problematic example, you mean you think this approach is functionally 
correct but may not perform very well...?

That is definitely the worst-case performance (for either top-1 or top-K on a 
wFST with simple PQ), but this (number of non-competitive arcs you have to scan 
and discard) is a constant factor on the overall complexity right?

I think we should at least test the simple PQ on PositiveIntsOutputs wFST and 
see how it performs in practice.  If indeed having everything "in one bucket" 
is too slow, we could combine the two approaches: still use buckets, but within 
each bucket we have a wFST (ie, use the "true" score), so we don't actually do 
any quantizing in the end results.  Then bucketing is purely an optimization...

Or, maybe, we could keep one bucket but sort each node's arcs by their output 
instead of by label.  This'd mean the initial lookup-by-prefix gets slower 
(linear scan instead a bin search, assuming those nodes had array'd arcs), but 
then producing the top-N is very fast (no wasted arcs need to be scanned).  
Maybe we could keep the by-label sort for nodes within depth N, and then sort 
by output beyond that...

Or we could change the outputs algebra so that more "lookahead" is stored in 
each output so we have more guidance on which arcs are worth pursuing...

                
> add suggester that uses shortest path/wFST instead of buckets
> -------------------------------------------------------------
>
>                 Key: LUCENE-3714
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3714
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/spellchecker
>            Reporter: Robert Muir
>         Attachments: LUCENE-3714.patch, out.png
>
>
> Currently the FST suggester (really an FSA) quantizes weights into buckets 
> (e.g. single byte) and puts them in front of the word.
> This makes it fast, but you lose granularity in your suggestions.
> Lately the question was raised, if you build lucene's FST with 
> positiveintoutputs, does it behave the same as a tropical semiring wFST?
> In other words, after completing the word, we instead traverse min(output) at 
> each node to find the 'shortest path' to the 
> best suggestion (with the highest score).
> This means we wouldnt need to quantize weights at all and it might make some 
> operations (e.g. adding fuzzy matching etc) a lot easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets

Reply via email to