RE: Doub't in the way lucene works

Stu Hood Wed, 13 Feb 2008 15:36:11 -0800

Hello Roopesh,

What you are seeing is called 'Stemming'. Stemming takes tokens and reduces 
them to their language specific prefixes. So for instance, when you search for 
attach, you get the word 'attachment', which shares a common English language 
specific prefix.


Newsletter is an interesting example: you will never get a match when you 
search for 'letter', because stemming only handles prefixes. The fact that you 
don't get a match for news is a bit more complicated. The stemming engine did 
not reduce newsletter all the way to the 'news' prefix, perhaps because the 
words have semantically different meanings (where in the attach/attachment 
case, an attachment is something that you attach).

I can't find any good Solr specific stemming links, but check out the Wikipedia 
page: http://en.wikipedia.org/wiki/Stemming

Thanks,
Stu


-----Original Message-----
From: Roopesh P Raj <[EMAIL PROTECTED]>
Sent: Wednesday, February 13, 2008 1:43am
To: solr-dev@lucene.apache.org
Subject: Doub't in the way lucene works

Hi,

I am using solr in my project. I have used the schema almost similar to 
the one given in the example folder which comes along when we download 
solr. Most of the fields that I use is of type "text", and the rest are 
of type "string".

Some of the search results are as follows:

When I search with a query, "attach", documents containing "attach", 
"attachment", "attachments" comes as the result.
When the search string is "attachment", then also documents containing 
"attach", "attachment", "attachments" comes as the result.

When I search for "newsletter", documents with keyword "newsletter" results.
But when I search for "news", no results appear.
When I search for "letter", then also there are no results.

Why does this happen?
Why is lucene not giving documents with "newsletter" when the search 
string given is "letter" or "news"?

I am pasting the "text" fieldtype declaration also. Please help me.

    <fieldType name="text" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" 
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Regards
Roopesh


------------------
DigitalGlue, India

RE: Doub't in the way lucene works

Reply via email to