I have studied some Russian. I kind of got the picture from the texts that all the exceptions had already been 'found', and were listed in the book.
I do know that languages are living, changing organisms, but Russian has got to be more regular than English I would think, even WITH all six cases and 3 genders. Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Tue, 7/27/10, Robert Muir <rcm...@gmail.com> wrote: > From: Robert Muir <rcm...@gmail.com> > Subject: Re: Russian stemmer > To: solr-user@lucene.apache.org > Date: Tuesday, July 27, 2010, 7:12 AM > right, but your problem is this is > the current output: > > Ковров -> Ковр > Коврову -> Ковров > Ковровом -> Ковров > Коврове -> Ковров > > so, if Ковров was simply left alone, all your forms > would match... > > 2010/7/27 Oleg Burlaca <o...@burlaca.com> > > > Thanks Robert for all your help, > > > > The idea of ы[A-Z].* stopwords is ideal for the > english language, > > although in russian nouns are inflected: Борис, > Борису, Бориса, Борисом > > > > I'll try the RussianLightStemFilterFactory (the > article in the PDF > > mentioned > > it's more accurate). > > > > Once again thanks, > > Oleg Burlaca > > > > On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir <rcm...@gmail.com> > wrote: > > > > > 2010/7/27 Oleg Burlaca <o...@burlaca.com> > > > > > > > Actually the situation with Немцов > из ок, > > > > I've just checked how Yandex works with > Немцов and Немцова: > > > > http://nano.yandex.ru/project/inflect/ > > > > > > > > I think there are two solutions: > > > > a) manually search for both Немцов and > then Немцова > > > > b) use wildcard query: Немцов* > > > > > > > > > > Well, here is one idea of a more general > solution. > > > The problem with "protected words" is you must > have a complete list. > > > > > > One idea would be to add a filter that protects > any words from stemming > > > that > > > match a regular expression: > > > In english maybe someone wants to avoid any > capitalized words to reduce > > > trouble: [A-Z].* > > > in your case then some pattern like [A-Я].*ов > might prevent problems. > > > > > > > > > > Robert, thanks for the > RussianLightStemFilterFactory info, > > > > I've found this page > > > > > > http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html > > > > that somehow describes it. Where can I read > more about > > > > RussianLightStemFilterFactory ? > > > > > > > > > > > Here is the link: > > > > > > > > http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf > > > > > > > > > > Regards, > > > > Oleg > > > > > > > > 2010/7/27 Oleg Burlaca <o...@burlaca.com> > > > > > > > > > A similar word is Немцов. > > > > > The strange thing is that searching for > "Немцова" will not find > > > documents > > > > > containing "Немцов" > > > > > > > > > > Немцова: 14 articles > > > > > > > > > > > > > > > > > > > http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 > > > > > > > > > > Немцов: 74 articles > > > > > > > > > > > > > > > > > > > http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Robert Muir > > > rcm...@gmail.com > > > > > > > > > -- > Robert Muir > rcm...@gmail.com >