autosuggest via solr.EdgeNGramFilterFactory (was: Re: wildcards in stopword list)

Lukas Kahwe Smith Wed, 03 Feb 2010 12:50:56 -0800

Hi Ahmet,

Well after some more testing I am now convinced that you rock :)
I like the solution because its obviously way less hacky and more importantly I 
expect this to be a lot faster and less memory intensive, since instead of a 
facet prefix or terms search, I am doing an "equality" comparison on tokens 
(albeit a fair number of them, but each much smaller). I can also have more 
control over the ordering of the results. I can also make full use of the 
stopword filter, which again should improve the sort order (like if I have a 
stopword "ag" and a word starts with "ag" it will not be "overpowered" by tons 
of strings containing "ag" as a single word). Obviously there is one limitation 
if people enter search terms longer than 20, but I think I can safely ignore 
this case. Even with long german words 15 letters should be enough to find what 
the user is looking for. and if a word needs more characters, then its probably 
a meaningless post fix like "versicherungsgesellschaft" which just means 
"insurance agency" and the user is just being stupid.


I do loose the nice numbers telling the user how often a given term matched, 
which has some merit for street/city names, less so for the names of people and 
close to none for company names. There is also a minor niggle with how the data 
is returned which I discuss at the end of the email.

I am using the following in my schema.xml

    <fieldType name="prefix_token" class="solr.TextField" 
positionIncrementGap="1">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="20" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
      </analyzer>
    </fieldType>

   <field name="name" type="prefix_token" indexed="true" stored="true" />
   <field name="firstname" type="prefix_token" indexed="true" stored="true" />
   <field name="email" type="prefix_token" indexed="true" stored="true" />
   <field name="city" type="prefix_token" indexed="true" stored="true" />
   <field name="street" type="prefix_token" indexed="true" stored="true" />
   <field name="telefon" type="prefix_token" indexed="true" stored="true" />
   <field name="id" type="string" indexed="true" stored="true" required="true" 
/>

and finally the following in my solrconfig.xml

  <requestHandler name="auto" class="solr.SearchHandler" default="true">
    <lst name="defaults">
     <str name="defType">dismax</str>
     <str name="echoParams">explicit</str>
     <int name="rows">10</int>
     <str name="qf">name firstname email^0.5 telefon^0.5 city^0.6 
street^0.6</str>
     <str name="fl">name,firstname,telefon,email,city,street</str>
    </lst>
  </requestHandler>

This all works well. There is just one minor uglyness, which might still be 
solveable inside solr, but I fixed it in the php frontend logic. The issue is 
that I obviously get all the fields for each document returned and I need to 
figure out for which I actually had a match to be presented in the autosuggest. 
Is there some Solr magic that will do this work for me?

        $query = new SolrQuery($searchstring);
        $response = $this->solrClientAuto->query($query);
        $numFound = empty($response->response->numFound) ? 0 : 
$response->response->numFound;
        $data = array('results' => array(), 'numFound' => $numFound);

        if (!empty($response->response->docs)) {
            $p = str_replace('"', '', substr($searchstring, 
strpos($searchstring, ' ')));
            foreach ($response->response->docs as $doc) {
                foreach ((array)$doc as $value) {
                    if (stripos($value, $p) === 0 || stripos($value, ' '.$p)) {
                        $data['results'][$value] = 1;
                    }
                }
            }
        }

Then again I have to review with the UI guys if we will always just show the 
name anyways and replace the entire user entered term with the name which 
should be sufficiently unique in most cases to get a small enough result set.

regards,
Lukas

autosuggest via solr.EdgeNGramFilterFactory (was: Re: wildcards in stopword list)

Reply via email to