On 2010-03-31 06:14, Andy wrote:


--- On Tue, 3/30/10, Andrzej Bialecki<a...@getopt.org>  wrote:

From: Andrzej Bialecki<a...@getopt.org>
Subject: Re: SOLR-1316 How To Implement this autosuggest component ???
To: solr-user@lucene.apache.org
Date: Tuesday, March 30, 2010, 9:59 AM
On 2010-03-30 15:42, Robert Muir
wrote:
On Mon, Mar 29, 2010 at 11:34 PM, Andy<angelf...@yahoo.com>
wrote:

Reading through this thread and SOLR-1316, there
seems to be a lot of
different ways to implement auto-complete in Solr.
I've seen the mentions
of:

EdgeNGrams
TermsComponent
Faceting
TST
Patricia Tries
RadixTree
DAWG



Another idea is you can use the Automaton support in
the lucene flexible
indexing branch: to query the index directly with a
DFA that represents
whatever terms you want back.
The idea is that there really isn't much gain in
building a separate Pat,
Radix Tree, or DFA to do this when you can efficiently
intersect a DFA with
the existing terms dictionary.

I don't really understand what autosuggest needs to
do, but if you are doing
things like looking for mispellings you can easily
build a DFA that
recognizes terms within some short edit distance with
the support thats
there (the LevenshteinAutomata class), to quickly get
back candidates.

You can intersect/concatenate/union these DFAs with
prefix or suffix DFAs if
you want too, don't really understand what the
algorithm should do, but I'm
happy to try to help.


The problem is a bit more complicated. There are two
issues:

* simple term-level completion often produces wrong results
for
multi-term queries (which are usually rewritten as "weak"
phrase queries),

* the weights of suggestions should not correspond directly
to IDF in
the index - much better results can be obtained when they
correspond to
the frequency of terms/phrases in the query logs ...

TermsComponent and EdgeNGrams, while simple to use, suffer
from both issues.


Thanks.

I actually have 2 use cases for autosuggest:

1) The "normal" one - I want to suggest search terms to users after they've 
typed a few letters. Just like Google suggest. Looks like for this use case SOLR-1316 is 
the best option. Right?

Hopefully, yes - it depends on how you intend to populate the TST. If you populate it from the main index, then (unless you have indexed phrases) there won't be any benefit over the TermsComponent. It may be faster, but it will take more RAM. If you populate it from a list of top-N queries, then SOLR-1316 is the way to go.

2) I have a field "city" with values that are entered by users. When a user is 
entering his city, I want to make suggestion based on what cities have already been 
entered so far by other users -- in order to reduce chances of duplication. What method 
would you recommend for this use case?

If the "city" field is not analyzed then TermsComponent is easiest to use. If it is analyzed, but vast majority of cities are single terms, then TermsComponent is ok too. If you want to assign different priorities to suggestions (other than a simple IDF based priority), or have many city names consisting of multiple tokens, then use SOLR-1316.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to