Thanks. That gives me the full list. The odd thing to me is that none of those words will end up being effective in a search, so why not strip them all out during indexing?
Thanks again, Jake. -----Original Message----- From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] Sent: Thursday, March 30, 2006 5:24 PM To: nutch-user@lucene.apache.org Subject: Re: Common Terms There is a list of stop words in NutchAnalysis class (org.apache.nutch.analysis). I guess thats where the common terms are removed during analysis. --Rajesh Munavalli Blog: http://mathsearch.blogspot.com Vanderdray, Jacob wrote: > I've added some code to query-basic to log the query after it > has run both addTerms and addPhrases. This helps me to better > understand what's going on. I've noticed that when my search contains > words like "the" or "a", those don't appear in the actual query. > > It looks to me like the common-terms.utf8 file is supposed to be > used to strip common words like "the" out of queries for specific > fields, but that doesn't seem to be what's happening. The term "the" > ends up getting stripped out of the query for all fields (url, content, > anchor, etc.). I even tried removing "the" from the common-terms.utf8 > file, but didn't see any change in behavior. > > Does this file only get used when indexing? If so what > determines which words get stripped out of searches? > > Thanks, > Jake. > >