Re: Problem with german hyphenated words not being found

Upayavira Thu, 11 Jun 2015 03:43:22 -0700

The next thing to do is add debugQuery=true to your URL (or enable it in
the query pane of the admin UI). Then look for the parsed query info.


On the standard text_en field which includes an English stop word
filter, I ran a query on "Jack and Jill's House" which showed
this output:

    "rawquerystring": "text_en:(Jack and Jill's House)", "querystring":
    "text_en:(Jack and Jill's House)", "parsedquery": "text_en:jack
    text_en:jill text_en:hous", "parsedquery_toString": "text_en:jack
    text_en:jill text_en:hous",

You can see that the parsed query is formed *after* analysis, so you can
see exactly what is being queried for.

Also, as a corollary to this, you can use the schema browser (or
faceting for that matter) to view what terms are being indexed, to see
if they should match.

HTH

Upayavira

> Am 11.06.2015 12:00 schrieb Upayavira:


>> Have you used the analysis tab in the admin UI? You can type in
sentences for both index and query time and see how they would be
analysed by various fields/field types.

Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:
>>> Hey, in german, you can string most nouns together by using hyphens,
>>> like this: Industrie = industry Anhänger = trailer Industrie-
>>> Anhänger = trailer for industrial use Here [1[1]], you can see me
>>> querying "Industrieanhänger" from the "name" field
>>> (name:Industrieanhänger), to make sure the index actually contains
>>> the word. Our data is structured that products are listed without
>>> the hyphen. Now, customers can come around and use the hyphenated
>>> version as a search term (i.e."industrie-anhänger"), and of course
>>> we want them to find what they are looking for. I've set it up so
>>> that the WordDelimiterFilterFactory uses catenateWords="1", so that
>>> these words are catenated. An analysis of "Industrieanhänger" as
>>> index and "industrie-anhänger" as query can be seen here [2[2]]. You
>>> can see that both word parts are found. However, querying for "industrie-
>>> anhänger" does not yield results, only when the hyphen is removed,
>>> as you can see here [3[3]]. I'm not sure how to proceed from here,
>>> as the results of the analysis have so far always lined up with what
>>> I could see when querying. Here's the schema definition for "text",
>>> the field type for the "name" field: <fieldType name="text"
>>> class="solr.TextField" positionIncrementGap="100"
>>> autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer
>>> class="solr.StandardTokenizerFactory"/> <filter
>>> class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
>>> splitOnNumerics="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="1" catenateNumbers="0" catenateAll="0"
>>> preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
>>> dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
>>> maxSubwordSize="30" onlyLongestMatch="false"/> <filter
>>> class="solr.StopFilterFactory" words="stopwords.txt"
>>> ignoreCase="true" enablePositionIncrements="true"
>>> format="snowball"/> <filter
>>> class="solr.GermanNormalizationFilterFactory"/> <filter
>>> class="solr.SnowballPorterFilterFactory" language="German2"
>>> protected="protwords.txt"/> <filter
>>> class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer>
>>> <analyzer type="query"> <tokenizer
>>> class="solr.WhitespaceTokenizerFactory"/> <filter
>>> class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
>>> splitOnNumerics="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="1" catenateNumbers="0" catenateAll="0"
>>> preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/>
>>> <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
>>> dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
>>> maxSubwordSize="30" onlyLongestMatch="false"/> --> <filter
>>> class="solr.StopFilterFactory" words="stopwords.txt"
>>> ignoreCase="true" enablePositionIncrements="true"
>>> format="snowball"/> <filter
>>> class="solr.GermanNormalizationFilterFactory"/> <filter
>>> class="solr.SnowballPorterFilterFactory" language="German2"
>>> protected="protwords.txt"/> <filter
>>> class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer>
>>> </fieldType> I've also thought it might be a problem with URL
>>> encoding not encoding the hyphen, but replacing it with %2D didn't
>>> change the outcome (and was probably wrong anyway). Any help is
>>> greatly appreciated. Links: ------ [1] http://imgur.com/2oEC5vz [2]
>>> http://i.imgur.com/H0AhEsF.png [3] http://imgur.com/dzmMe7t



Links:

  1. http://imgur.com/2oEC5vz
  2. http://i.imgur.com/H0AhEsF.png
  3. http://imgur.com/dzmMe7t

Re: Problem with german hyphenated words not being found

Reply via email to