[jira] [Commented] (LUCENE-3842) Analyzing Suggester

Sudarshan Gaikaiwari (JIRA) Wed, 30 May 2012 18:27:27 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286249#comment-13286249
 ]


Sudarshan Gaikaiwari commented on LUCENE-3842:
----------------------------------------------

Hi Michael,

Thanks a lot for opening up Util.shortestPaths, now that I can seed the queue 
with the intial nodes using addStartPaths the performance of the 
GeoSpatialSuggest that I presented at Lucene Revolution has been improved by 2x.

While migrating my code to use this patch, I noticed that I would hit the 
following assertion in addIfCompetitive.

{code}

          path.input.length--;
          assert cmp != 0;
          if (cmp < 0) {
{code}

This assert fires when it is not possible to differentiate between the path 
that we are trying to add to the queue and the bottom. This happens because the 
different paths that lead to FST nodes during the automata FST intersection are 
not stored. So the inputpath used to differentiate path contains only the 
characters that have been consumed from one of the initial FST nodes.

>From your comments for the addStartPaths method I think that you have foreseen 
>this problem.

{code}
    // nocommit this should also take the starting
    // weight...?

    /** Adds all leaving arcs, including 'finished' arc, if
     *  the node is final, from this node into the queue.  */
    public void addStartPaths(FST.Arc<T> node, T startOutput, boolean 
allowEmptyString) throws IOException {
{code}

Here is a unit test that causes the assert to be triggered.

{code}
  public void testInputPathRequired() throws Exception {
    TermFreq keys[] = new TermFreq[] {
        new TermFreq("fast ghost", 50),
        new TermFreq("quick gazelle", 50),
        new TermFreq("fast ghoul", 50),
        new TermFreq("fast gizzard", 50),
    };

    SynonymMap.Builder b = new SynonymMap.Builder(false);
    b.add(new CharsRef("fast"), new CharsRef("quick"), true);
    final SynonymMap map = b.build();

    final Analyzer analyzer = new Analyzer() {
      @Override
      protected TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
        Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.SIMPLE, 
true);
        TokenStream stream = new SynonymFilter(tokenizer, map, true);
        return new TokenStreamComponents(tokenizer, new 
RemoveDuplicatesTokenFilter(stream));
      }
    };
    AnalyzingCompletionLookup suggester = new 
AnalyzingCompletionLookup(analyzer);
    suggester.build(new TermFreqArrayIterator(keys));
    List<LookupResult> results = suggester.lookup("fast g", false, 2);
  }
{code}

Please let me know if the above analysis looks correct to you and I will start 
trying to fix this by storing paths during the FST automata intersection.
                
> Analyzing Suggester
> -------------------
>
>                 Key: LUCENE-3842
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3842
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/spellchecker
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
> LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
> LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
> comparator in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities 
> than just basic prefix matching.
> In particular I think the most flexible approach is to integrate with 
> Analyzer at both build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an 
> optional token separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path 
> operation on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface 
> form), which will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore 
> stopwords, e.g. if you type in "ghost of chr...",
>   it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there 
> are tradeoffs here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese 
> suggesters, where the analyzed form is in fact the reading,
>   so we would add a TokenFilter that copies ReadingAttribute into term text 
> to support that...
> * other general things like offering suggestions that are more "fuzzy" like 
> using a plural stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the 
> prototype (e.g. ~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far 
> smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3842) Analyzing Suggester

Reply via email to