Hi,

Solr suggester is wonderful. We have been testing the built-in dictionary 
implementations for some large-ish datasets (36m, 132m), and getting 
single/teen milli-seconds response times with 9 multiple dictionaries per 
request. Most of the resulting dictionaries have millions entries too. 
Intrigued with the "finite-state machines" in the prefix/fuzzy suggesters too. 
Can’t wait to load test this properly.

Now I have some questions:

1. Term frequency, weight/count
The suggesters derive suggestions from a field in the index. What’s the 
feasibility of creating a custom dictionary that can automatically populate the 
weight/count field using term frequency (tf) during build time?

Autosuggest in most cases, is ranked by popularity. “Apache Solr — 3” (say the 
term occurred 3 times in a field). Why must we specific a weight field 
explicitly for popularity ranking while tf data is readily available from the 
index?

In our tests, we had to index the datasets twice. In the second pass, tf is 
looked up per doc via Solr’s term component, and coded in a weight field (for 
the suggester). The lookup is also necessary for each dictionary field. We used 
4 fields for 9 dictionaries. This really provides extra “incentive” to do 
things in parallel!

2. "mm” lookup
The lookup logic for multi-terms is currently “AND boolean”, i.e. the 
suggestion must matched all terms in suggest.q. However, in our use case, we 
need “OR boolean” for one dictionary. This is a bit like "suggest.q infix”, 
e.g.:

suggest.q:
Apache Sol
suggestions:
Apache Solr
Solr
SolrCloud
We love Apache

I couldn’t get any of the existing lookup impl to find the last three 
suggestions. Perhaps it’s time to dig into the codes to see if a “minumum 
should match” mm (50% in the above case) feature is a possibility?

Thanks,

Boon


-----
Boon Low
Search Engineer / Lead Big Data
DCT Family History
http://uk.linkedin.com/in/boonlow/

________________________________
This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of D.C. Thomson Family 
History shall be understood as neither given nor endorsed by it.
________________________________

__________________________________________________________________________

This email has been checked for virus and other malicious content prior to 
leaving our network.
__________________________________________________________________________

Reply via email to