Hi,

So over the course of the last two weeks I have been trying to come up with an 
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can 
have german, english, italian or french names. people have an additional 
firstname field. We also want to do auto suggest on the street and city names 
as well as on emails and telefon numbers. as such we are treating phonenumbers 
as text.

We do have the option for the user to use phonetic searches or to split 
(especially the compound german words), but I guess we will leave that out of 
the auto suggest.
We do expect that some users will type in properly cased strings, while some 
may just type in all lowercase.
We are using the dismax defType for our normal queries.

There will probably be less than 20M entities.

As such I guess the best approach is to copy all of the above mentioned fields 
(name, firstname, city, street, email, telefon) into a new field called "all".
It seems the best approach is to use facet.prefix for our requirements. We will 
therefore split of the last term in the query and pass it in as the 
"facet.prefix" while the rest is passed in as the "q" parameter.

Since facet's are driven out of the index, we will use the following type 
definition for this "all" field:
    <fieldType name="textplain" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="0"/>
      </analyzer>
    </fieldType>

So essentially the idea is to just split on whitespace, remove stop words and 
word delimiters.

The query would then look something like the following if the user would enter 
"Kaltenreider Ver":
http://localhost:8983/solr/core0/select?defType=dismax&qf=all&q= 
Kaltenreider&indent=on&facet=on&facet.limit=10&facet.mincount=1&facet.field=all&rows=0&facet.prefix=Ver

Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of 
ram, albeit all of that will be shared with apache, mysql slave and a php app? 
Ah well questions like that are impossible to answer, so just trying to ask if 
you expect this to be really heavy. I noticed that in my initial testing with 
2M on my laptop facets seemed to be fine, though the first request was slow and 
the memory use spiked to 300MB. But I presume its just loading stuff into cache 
and concurrent requests shouldnt cause the memory use to go up linearly.

I am still a bit unsure how to handle both the lowercased and the case 
preserved version:

So here are some examples:
UBS => ubs|UBS
Kreuzstrasse => kreuzstrasse|Kreuzstrasse

So when I type "Kreu" I would get a suggestion of "Kreuzstrasse" and with 
"kreu" I would get "kreuzstrasse".
Since I do not expect any words to start with a lowercase letter and still 
contain some upper case letter we should be fine with this approach.

As in I doubt there would be stuff like "fooBar" which would lead to suggestion 
both "foobar" and "fooBar".

How can I achieve this?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Reply via email to