[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769459#comment-13769459
 ] 

Michael McCandless commented on LUCENE-5214:
--------------------------------------------

The build method basically just runs all incoming text through the
indexAnalyzer, appending ShingleFilter on the end to generate the
ngrams.  To "aggregate" the ngrams it simply writes them to the
offline sorter; this is nice and simple but somewhat inefficient in
how much transient disk and CPU it needs to sort all the ngrams, but
it works (thanks Rob)!  It may be better to have an in-memory hash
that holds the frequent ngrams, and periodically flushes the "long
tail" to free up RAM.  But this gets more complex... the current code
is very simple.

After sorting the ngrams, it walks them, counting up how many times
each gram occurred and then adding that to the FST.  Currently, I do
nothing with the surface form, i.e. the suggester only suggests the
analyzed forms, which may be too ... weird?  Though in playing around,
I think the analysis you generally want to do should be very "light",
so maybe this is OK.

It can also save the surface form in the FST (I was doing that before;
it's commented out now), but ... how to disambiguate?  Currently it
saves the shortest one.  This also makes the FST even larger.

At lookup time I again just run through your analyzer + ShingleFilter,
and then try first to lookup 3grams, failing that to lookup 2grams,
etc.  I need to improve this to do some sort of smoothing like "real"
ngram language models do; it shouldn't be this "hard" backoff.

Anyway, it's great fun playing with the suggester live (using the simplistic
command-line tool in luceneutil, freedb/suggest.py) to "explore" the
ngram language model.  This is how I discovered LUCENE-5180.

                
> Add new FreeTextSuggester, to handle "long tail" suggestions
> ------------------------------------------------------------
>
>                 Key: LUCENE-5214
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5214
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, 4.6
>
>         Attachments: LUCENE-5214.patch
>
>
> The current suggesters are all based on a finite space of possible
> suggestions, i.e. the ones they were built on, so they can only
> suggest a full suggestion from that space.
> This means if the current query goes outside of that space then no
> suggestions will be found.
> The goal of FreeTextSuggester is to address this, by giving
> predictions based on an ngram language model, i.e. using the last few
> tokens from the user's query to predict likely following token.
> I got the idea from this blog post about Google's suggest:
> http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
> This is very much still a work in progress, but it seems to be
> working.  I've tested it on the AOL query logs, using an interactive
> tool from luceneutil to show the suggestions, and it seems to work well.
> It's fun to use that tool to explore the word associations...
> I don't think this suggester would be used standalone; rather, I think
> it'd be a fallback for times when the primary suggester fails to find
> anything.  You can see this behavior on google.com, if you type "the
> fast and the ", you see entire queries being suggested, but then if
> the next word you type is "burning" then suddenly you see the
> suggestions are only based on the last word, not the entire query.
> It uses ShingleFilter under-the-hood to generate the token ngrams;
> once LUCENE-5180 is in it will be able to properly handle a user query
> that ends with stop-words (e.g. "wizard of "), and then stores the
> ngrams in an FST.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to