[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets

Dawid Weiss (Commented) (JIRA) Thu, 26 Jan 2012 06:11:38 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193855#comment-13193855
 ]


Dawid Weiss commented on LUCENE-3714:
-------------------------------------

Looked through the patch. Some comments:

+        // nocommit: why does the benchmark test supply duplicates? what 
should we do in this case?

ignore duplicates. I think it is allowed for the iterator to allow duplicates; 
I'm not sure but this may even be used when bucketing input -- the same entry 
with a different score (before bucketing) may end up identical after sorting.

+    // nocommit: shouldn't we have an option to
+    // match exactly even if its a crappy suggestion? we would just look
+    // for arc.isFinal here etc...

Yes, this should definitely be an option because if it's an exact match then 
you'll probably want it on top of the suggestions list, no matter what.

You could also add a generator of the "bad case" that I attached inside 
TestMe.java -- this creates the case when following greedily doesn't yield 
correct output (requires a pq).

I also checked the benchmark and yes, it uses exactMatchFirst promotion. This 
clarifies the speed difference for longer prefixes -- not enough results are 
collected in any of the buckets so no early termination occurs and _all_ 
buckets must be traversed.

I like WFSTCompletionLookup very much, clean and simple.
                
> add suggester that uses shortest path/wFST instead of buckets
> -------------------------------------------------------------
>
>                 Key: LUCENE-3714
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3714
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/spellchecker
>            Reporter: Robert Muir
>         Attachments: LUCENE-3714.patch, LUCENE-3714.patch, LUCENE-3714.patch, 
> TestMe.java, out.png
>
>
> Currently the FST suggester (really an FSA) quantizes weights into buckets 
> (e.g. single byte) and puts them in front of the word.
> This makes it fast, but you lose granularity in your suggestions.
> Lately the question was raised, if you build lucene's FST with 
> positiveintoutputs, does it behave the same as a tropical semiring wFST?
> In other words, after completing the word, we instead traverse min(output) at 
> each node to find the 'shortest path' to the 
> best suggestion (with the highest score).
> This means we wouldnt need to quantize weights at all and it might make some 
> operations (e.g. adding fuzzy matching etc) a lot easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets

Reply via email to