right, but we should not encourage users to significantly degrade
overall relevance for all movies due to a few movies and a band (very
special cases, as I said).

In english, by not using stopwords, it doesn't really degrade
relevance that much, so its a reasonable decision to make. This is not
true in other languages!

Instead, systems that worry about all-stopword queries should use
CommonGrams. it will work better for these cases, without taking away
from overall relevance.

On Wed, Jan 13, 2010 at 1:08 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> There is a band named "The The". And a producer named "Don Was". For a list 
> of all-stopword movie titles at Netflix, see this post:
>
> http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html
>
> My favorite is "To Be and To Have (Être et Avoir)", which is all stopwords in 
> two languages. And a very good movie.
>
> wunder
>
> On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:
>
>> sorry, i forgot to include this 2009 paper comparing what stopwords do
>> across 3 languages:
>>
>> http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
>>
>> in my opinion, if stopwords annoy your users for very special cases
>> like 'the the' then, instead consider using commongrams +
>> defaultsimilarity.discountOverlaps = true so that you still get the
>> benefits.
>>
>> as you can see from the above paper, they can be extremely important
>> depending on the language, they just don't matter so much for English.
>>
>> On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <goks...@gmail.com> wrote:
>>> There are a lot of projects that don't use stopwords any more. You
>>> might consider dropping them altogether.
>>>
>>> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <d...@madwombat.com> wrote:
>>>> This is the way I've implemented multilingual search as well.
>>>>
>>>> 2010/1/11 Markus Jelsma <mar...@buyways.nl>
>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>> We have implemented language specific search in Solr using language
>>>>> specific fields and field types. For instance, an en_text field type can
>>>>> use an English stemmer, and list of stopwords and synonyms. We, however
>>>>> did not use specific stopwords, instead we used one list shared by both
>>>>> languages.
>>>>>
>>>>> So you would have a field type like:
>>>>> <fieldType name="en_text" class="solr.TextField" ...
>>>>>  <analyzer type="">
>>>>>  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>>>>  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>>>>>
>>>>> etc etc.
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -
>>>>> Markus Jelsma          Buyways B.V.
>>>>> Technisch Architect    Friesestraatweg 215c
>>>>> http://www.buyways.nl  9743 AD Groningen
>>>>>
>>>>>
>>>>> Alg. 050-853 6600      KvK  01074105
>>>>> Tel. 050-853 6620      Fax. 050-3118124
>>>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>>>
>>>>>
>>>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>>>
>>>>>> Hi Solr users.
>>>>>>
>>>>>> I'm trying to set up a site with Solr search integrated. And I use the
>>>>>> SolJava API to feed the index with search documents. At the moment I
>>>>>> have only activated search on the English portion of the site. I'm
>>>>>> interested in using as many features of solr as possible. Synonyms,
>>>>>> Stopwords and stems all sounds quite interesting and useful but how do
>>>>>> I set up this in a good way for a multilingual site?
>>>>>>
>>>>>> The site don't have a huge text mass so performance issues don't
>>>>>> really bother me but still I'd like to hear your suggestions before I
>>>>>> try to implement an solution.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> Daniel
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>
>



-- 
Robert Muir
rcm...@gmail.com

Reply via email to