Re: False Positives?

Walter Underwood Fri, 11 Jun 2010 08:52:43 -0700

This filter chain takes a word, stems it, then converts the stem to a phonetic 
representation.


1. Only do one transformation for each field, like stemming or phonetic.
2. Stemming isn't useful for names.

You are also removing stopwords, which can be a problem for names.

Here is an example of what that chain is doing. You should be able to see this 
with the analysis page in the admin UI.

"The Cars"
"cars"  (remove stopwords, lower case)
"car" (stem)
"KR" (phonetic)

There are some other problems here, like using synonyms at query time. That 
results in unexpected scoring, because the synonyms will have different IDFs. 
The variant that is most rare in the index will win. If they are applied at 
index time, all variants will have the same IDF.

wunder

On Jun 11, 2010, at 2:12 AM, jglazner wrote:

> 
> Chantal,
> 
> Thanks for the quick response:
> 
> Here is the field def from the schema for the field and the field type:
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
> inject="false"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
> inject="false"/>
>      </analyzer>
>    </fieldType>
> 
> <field name="name_title" type="text" indexed="true" stored="true"
> multiValued="false" />
> 
> Jed.
> 
> On Fri, Jun 11, 2010 at 3:05 AM, Chantal Ackermann [via Lucene] <
> ml-node+888051-797127135-9...@n3.nabble.com<ml-node%2b888051-797127135-9...@n3.nabble.com>
>> wrote:
> 
>> Hi Jed,
>> 
>> please paste the complete field definition of "name_title" from your
>> schema.xml.
>> 
>> You are using an analyzer that reduces your text in an undesired way, on
>> both index and query side. You probably want "String" for names, or
>> similar.
>> 
>> "Spoorenberg" or "Saprano" are analyzed in the same way as
>> "springsteen", obviously. And the result is "SPRN" for all of them.
>> 
>> Chantal
>> 
>> 
>>> <str name="rawquerystring">springsteen</str>
>>> <str name="querystring">springsteen</str>
>>> <str name="parsedquery">name_title:SPRN</str>
>>> <str name="parsedquery_toString">name_title:SPRN</str>
>>> <lst name="explain">
>>> <str name="artist.artist.3106">
>>> 4.42386 = (MATCH) fieldWeight(name_title:SPRN in 3105), product of:
>>>  1.4142135 = tf(termFreq(name_title:SPRN)=2)
>>>  6.2562833 = idf(docFreq=704, maxDocs=135196)
>>>  0.5 = fieldNorm(field=name_title, doc=3105)
>>> </str>
>>> ..........
>>> this goes on for all the results.  So as near as I could tell it took the
>> 
>>> term sprintgsteen and truncated it to sprn?  but even so how does sprn
>> match
>>> "Spoorenberg" or "Saprano"?
>>> 
>>> I'm using solr 1.4
>>> 
>>> Thanks for any input you can give me.
>>> 
>>> Jed.
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> ------------------------------
>> View message @
>> http://lucene.472066.n3.nabble.com/False-Positives-tp888027p888051.html
>> To unsubscribe from False Positives?, click here< (link removed) >.
>> 
>> 
>> 
> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/False-Positives-tp888027p888077.html
> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto

Re: False Positives?

Reply via email to