right, but we should not encourage users to significantly degrade overall relevance for all movies due to a few movies and a band (very special cases, as I said).
In english, by not using stopwords, it doesn't really degrade relevance that much, so its a reasonable decision to make. This is not true in other languages! Instead, systems that worry about all-stopword queries should use CommonGrams. it will work better for these cases, without taking away from overall relevance. On Wed, Jan 13, 2010 at 1:08 AM, Walter Underwood <wun...@wunderwood.org> wrote: > There is a band named "The The". And a producer named "Don Was". For a list > of all-stopword movie titles at Netflix, see this post: > > http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html > > My favorite is "To Be and To Have (Être et Avoir)", which is all stopwords in > two languages. And a very good movie. > > wunder > > On Jan 12, 2010, at 6:55 PM, Robert Muir wrote: > >> sorry, i forgot to include this 2009 paper comparing what stopwords do >> across 3 languages: >> >> http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf >> >> in my opinion, if stopwords annoy your users for very special cases >> like 'the the' then, instead consider using commongrams + >> defaultsimilarity.discountOverlaps = true so that you still get the >> benefits. >> >> as you can see from the above paper, they can be extremely important >> depending on the language, they just don't matter so much for English. >> >> On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <goks...@gmail.com> wrote: >>> There are a lot of projects that don't use stopwords any more. You >>> might consider dropping them altogether. >>> >>> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <d...@madwombat.com> wrote: >>>> This is the way I've implemented multilingual search as well. >>>> >>>> 2010/1/11 Markus Jelsma <mar...@buyways.nl> >>>> >>>>> Hello, >>>>> >>>>> >>>>> We have implemented language specific search in Solr using language >>>>> specific fields and field types. For instance, an en_text field type can >>>>> use an English stemmer, and list of stopwords and synonyms. We, however >>>>> did not use specific stopwords, instead we used one list shared by both >>>>> languages. >>>>> >>>>> So you would have a field type like: >>>>> <fieldType name="en_text" class="solr.TextField" ... >>>>> <analyzer type=""> >>>>> <filter class="solr.StopFilterFactory" words="stopwords.en.txt"> >>>>> <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt"> >>>>> >>>>> etc etc. >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> - >>>>> Markus Jelsma Buyways B.V. >>>>> Technisch Architect Friesestraatweg 215c >>>>> http://www.buyways.nl 9743 AD Groningen >>>>> >>>>> >>>>> Alg. 050-853 6600 KvK 01074105 >>>>> Tel. 050-853 6620 Fax. 050-3118124 >>>>> Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 >>>>> >>>>> >>>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: >>>>> >>>>>> Hi Solr users. >>>>>> >>>>>> I'm trying to set up a site with Solr search integrated. And I use the >>>>>> SolJava API to feed the index with search documents. At the moment I >>>>>> have only activated search on the English portion of the site. I'm >>>>>> interested in using as many features of solr as possible. Synonyms, >>>>>> Stopwords and stems all sounds quite interesting and useful but how do >>>>>> I set up this in a good way for a multilingual site? >>>>>> >>>>>> The site don't have a huge text mass so performance issues don't >>>>>> really bother me but still I'd like to hear your suggestions before I >>>>>> try to implement an solution. >>>>>> >>>>>> Best regards >>>>>> >>>>>> Daniel >>>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> > > -- Robert Muir rcm...@gmail.com