Re: Problem with german hyphenated words not being found

Upayavira Thu, 11 Jun 2015 03:03:30 -0700

Have you used the analysis tab in the admin UI? You can type in
sentences for both index and query time and see how they would be
analysed by various fields/field types.


Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:
>  Hey,
> 
> in german, you can string most nouns together by using hyphens, like
> this:
> 
> Industrie = industry
> Anhänger = trailer
> 
> Industrie-Anhänger = trailer for industrial use
> 
> Here [1], you can see me querying "Industrieanhänger" from the "name"
> field (name:Industrieanhänger), to make sure the index actually contains
> the word. Our data is structured that products are listed without the
> hyphen.
> 
> Now, customers can come around and use the hyphenated version as a
> search term (i.e."industrie-anhänger"), and of course we want them to
> find what they are looking for. I've set it up so that the
> WordDelimiterFilterFactory uses catenateWords="1", so that these words
> are catenated. An analysis of "Industrieanhänger" as index and
> "industrie-anhänger" as query can be seen here [2].
> 
> You can see that both word parts are found. However, querying for
> "industrie-anhänger" does not yield results, only when the hyphen is
> removed, as you can see here [3]. I'm not sure how to proceed from here,
> as the results of the analysis have so far always lined up with what I
> could see when querying. Here's the schema definition for "text", the
> field type for the "name" field:
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>  <analyzer type="index">
>  <tokenizer class="solr.StandardTokenizerFactory"/>
>  <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
> splitOnNumerics="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="0" catenateAll="0"
> preserveOriginal="1"/>
>  <filter class="solr.LowerCaseFilterFactory"/>
>  <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
> maxSubwordSize="30" onlyLongestMatch="false"/>
>  <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true" enablePositionIncrements="true" format="snowball"/>
>  <filter class="solr.GermanNormalizationFilterFactory"/>
>  <filter class="solr.SnowballPorterFilterFactory" language="German2"
> protected="protwords.txt"/>
>  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  </analyzer>
>  <analyzer type="query">
>  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>  <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
> splitOnNumerics="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="0" catenateAll="0"
> preserveOriginal="1"/>
>  <filter class="solr.LowerCaseFilterFactory"/>
>  <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
> maxSubwordSize="30" onlyLongestMatch="false"/> -->
>  <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true" enablePositionIncrements="true" format="snowball"/>
>  <filter class="solr.GermanNormalizationFilterFactory"/>
>  <filter class="solr.SnowballPorterFilterFactory" language="German2"
> protected="protwords.txt"/>
>  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  </analyzer>
> </fieldType>
> 
> I've also thought it might be a problem with URL encoding not encoding
> the hyphen, but replacing it with %2D didn't change the outcome (and was
> probably wrong anyway).
> 
> Any help is greatly appreciated. 
> 
> Links:
> ------
> [1] http://imgur.com/2oEC5vz
> [2] http://i.imgur.com/H0AhEsF.png
> [3] http://imgur.com/dzmMe7t

Re: Problem with german hyphenated words not being found

Reply via email to