RE: Common Terms

Vanderdray, Jacob Thu, 30 Mar 2006 14:54:55 -0800

        Thanks.  That gives me the full list.  The odd thing to me is
that none of those words will end up being effective in a search, so why
not strip them all out during indexing?


Thanks again,
Jake.

-----Original Message-----
From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 30, 2006 5:24 PM
To: nutch-user@lucene.apache.org
Subject: Re: Common Terms

There is a list of stop words in NutchAnalysis class 
(org.apache.nutch.analysis). I guess thats where the common terms are 
removed during analysis.

--Rajesh Munavalli
Blog: http://mathsearch.blogspot.com

Vanderdray, Jacob wrote:
>       I've added some code to query-basic to log the query after it
> has run both addTerms and addPhrases.  This helps me to better
> understand what's going on.  I've noticed that when my search contains
> words like "the" or "a", those don't appear in the actual query.
>
>       It looks to me like the common-terms.utf8 file is supposed to be
> used to strip common words like "the" out of queries for specific
> fields, but that doesn't seem to be what's happening.  The term "the"
> ends up getting stripped out of the query for all fields (url,
content,
> anchor, etc.).  I even tried removing "the" from the common-terms.utf8
> file, but didn't see any change in behavior.
>
>       Does this file only get used when indexing?  If so what
> determines which words get stripped out of searches?
>
> Thanks,
> Jake.
>
>

RE: Common Terms

Reply via email to