Re: LucidWorks Solr
On Wed, Apr 21, 2010 at 1:38 PM, Shashi Kant wrote: > Why do these approaches have to be mutually exclusive? > Do a dictionary lookup, if no satisfactory match found use an > algorithmic stemmer. Would probably save a few CPU cycles by > algorithmic stemming iff necessary. > > by the way, if you want to do this, you can do it easily in Solr trunk. Just put a StemmerOverrideFilterFactory in front of your stemmer, containing tab-separated dictionary-word stem mappings. In the test-files directory is an example of this (stemdict.txt): # test that we can override the stemming algorithm with our own mappings # these must be tab-separated monkeysmonkey ottersotter # some crazy ones that a stemmer would never do dogscat You can use this factory, or the new KeywordMarkerFilterFactory, which is similar but simply takes a text file like protwords.txt, for the stemmer to ignore. Both of these filters set a special attribute for this token in the tokenstream that all stemmers respect, and they won't do any stemming on this token -- Robert Muir rcm...@gmail.com
Re: LucidWorks Solr
I like this discussion pretty much. It is a really complex topic. I want to add another example. In english, you are saying "it is a red dress". In german it would mean "es ist ein rotes Kleid" (words can be translated in the same order). However the basic form of "rotes" is "rot". If your users are searching for "rotes Kleid" they may also expect matching documents like "Kleid rot" or something like that. To draw a conclusion on the discussion until now, I want to quote Mark: Mark Miller-3 wrote: > > are you going to > want documents that had run and water when you searched for running > water? In most cases, everyone disagrees. Let's have an abstract view at this topic: When should an application stem a word, and when not? In my opinion, it allways makes sense to stem adjectives - and in most languages one can decide whether a word is an adjective or not, even if it is not known by any dictionary. But what about verbs? Is a stemmed verb less worth than an stemmed adjective? In cases of titles - which are short - I think yes. In cases of longer types of texts like articles and descriptions - I am not sure. The same applies on lemmatization. Yes, reducing the word will work fine in a lot of cases, where the word is known. However, does it really makes sense all the time? I want to emphasize that this discussion only makes sense, if we want do talk about search-applications which are tolerant and made for highly relevant search results without exact matching. Kind regards - Mitch -- View this message in context: http://n3.nabble.com/LucidWorks-Solr-tp727341p741090.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
On Wed, Apr 21, 2010 at 3:29 PM, Mark Miller wrote: > > Stemming/lematization will pretty much always improve recall at the cost of > precision - that's nothing new. If you stem instead, are you going to want > documents that had run and water when you searched for running water? I just > don't see this point as an argument against lemmatization and in favor of > stemming. > > its not really supposed to be an argument in favor of stemming. I just don't think lemmatization/dictionary resources are any better. here's a test that seems to agree: http://www.clef-campaign.org/2003/WN_web/19.pdf (for languages with compound word forms, the lexical approach helps, obviously, but for stuff like English, Italian, nope) -- Robert Muir rcm...@gmail.com
Re: LucidWorks Solr
On 4/21/10 3:22 PM, Robert Muir wrote: On Wed, Apr 21, 2010 at 2:26 PM, Mark Miller wrote: Its an orthogonal issue - running will have that problem no matter what. It doesn't affect whether a user that types running may be just as interested in a doc that matches all of their other terms but has ran instead of running. Its also just a simple example. Its not orthogonal, e.g. "running water" Stemming/lematization will pretty much always improve recall at the cost of precision - that's nothing new. If you stem instead, are you going to want documents that had run and water when you searched for running water? I just don't see this point as an argument against lemmatization and in favor of stemming. - Mark
Re: LucidWorks Solr
On Wed, Apr 21, 2010 at 2:26 PM, Mark Miller wrote: > > Its an orthogonal issue - running will have that problem no matter what. It > doesn't affect whether a user that types running may be just as interested > in a doc that matches all of their other terms but has ran instead of > running. Its also just a simple example. > > Its not orthogonal, e.g. "running water" -- Robert Muir rcm...@gmail.com
Re: Stemming [was: LucidWorks Solr]
IMHO, a 'stemmer' (being a specific 'thing') is exactly that. An algorithm for stemming. A database or lexicon is not referred to as a 'stemmer'. One can perform "stemming" using a lexicon if that's their need. For me, its more than just stemming because some words have morphology totally separate from "stems" that need to be known in searching and NLP (e.g. verb conjugations). Maybe some call that stemming too, but I never have personally. On Wed, 2010-04-21 at 10:18 -0700, Chris Hostetter wrote: > : Regarding stemmers, I ditched them altogether a long time ago in favor > : of a dictionary of morphologies of all known words (for any given > : language). A simple lookup of any word morphology thus produces the set, > : including the correct stem. > > Strictly speaking: you haven't "ditched" stemmers altogether -- you've > ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer > -- but it's still a stemmer. > > (i just don't want people reading this thread to be confused about > terminology) > > > -Hoss >
Re: LucidWorks Solr
On 4/21/10 2:20 PM, Robert Muir wrote: On Wed, Apr 21, 2010 at 2:09 PM, Mark Miller wrote: Right - I agree they both have their strengths and weakness' - but you usually don't get things like running->ran with stemming. Like most things, its a tradeoff. There is always a hybrid approach as well. I think running/ran has more problems, the word is so ambiguous that whether or not your search engine stems it right isn't going to matter anyway (running for office has nothing to do with running shoes, etc) Its an orthogonal issue - running will have that problem no matter what. It doesn't affect whether a user that types running may be just as interested in a doc that matches all of their other terms but has ran instead of running. Its also just a simple example. - Mark
Re: LucidWorks Solr
On Wed, Apr 21, 2010 at 2:09 PM, Mark Miller wrote: > > Right - I agree they both have their strengths and weakness' - but you > usually don't get things like running->ran with stemming. Like most things, > its a tradeoff. There is always a hybrid approach as well. > > I think running/ran has more problems, the word is so ambiguous that whether or not your search engine stems it right isn't going to matter anyway (running for office has nothing to do with running shoes, etc) -- Robert Muir rcm...@gmail.com
Re: LucidWorks Solr
On 4/21/10 2:02 PM, Robert Muir wrote: On Wed, Apr 21, 2010 at 1:49 PM, Mark Miller wrote: I believe that's covered by morphology? The problem is typically a morphological analyzer emits multiple solutions, which include POS. So morphology can tell you that "building" has two solutions: the gerund form which you might stem to "build", or the noun form which you would stem to "building". But, you need more stuff (POS tagging, etc) to decide which to pick to arrive at a lemma... and if your users are entering very short queries you can see how this could be inaccurate, since there isn't much context. So what snowball does (simply stemming build, building, buildings all to "build") might seem silly at first, but you can see how it avoids this entire mess. Right - I agree they both have their strengths and weakness' - but you usually don't get things like running->ran with stemming. Like most things, its a tradeoff. There is always a hybrid approach as well. - Mark
Re: LucidWorks Solr
On Wed, Apr 21, 2010 at 1:49 PM, Mark Miller wrote: > > I believe that's covered by morphology? > > The problem is typically a morphological analyzer emits multiple solutions, which include POS. So morphology can tell you that "building" has two solutions: the gerund form which you might stem to "build", or the noun form which you would stem to "building". But, you need more stuff (POS tagging, etc) to decide which to pick to arrive at a lemma... and if your users are entering very short queries you can see how this could be inaccurate, since there isn't much context. So what snowball does (simply stemming build, building, buildings all to "build") might seem silly at first, but you can see how it avoids this entire mess. -- Robert Muir rcm...@gmail.com
Re: LucidWorks Solr
On 4/21/10 1:43 PM, Walter Underwood wrote: On Apr 21, 2010, at 10:30 AM, Mark Miller wrote: But they don't usually call 'non algorithmic' stemming 'stemming'. Stemming usually means using a simple heuristic process. When you use vocabulary and morphology, its usually called lemmatization rather than stemming. "stemmer" is jargon that does not have a precise definition. Usually, as the wikipedia article Robert linked to states, stemming is done without knowledge of the context of the word. With stemming you are not necessarily finding lemmas - just stems. Stems can be anything as long as the same word always stems to the same thing - lemmas are more than that. I don't think the definition is super precise, but I also wouldn't call it jargon. For example, the LinguistX morphological analyzers are called "stemmers" and they provide options that are dictionary-based inflectional, dictionary-based derivational, and algorithmic. You can also combine those, so you can get accurate dictionary-based stems, then use an algorithmic stemmer on words not in the dictionary. That just sounds like a mix of stemming and lemmatization. - Mark
Re: LucidWorks Solr
On 4/21/10 1:43 PM, Robert Muir wrote: On Wed, Apr 21, 2010 at 1:30 PM, Mark Miller wrote: But they don't usually call 'non algorithmic' stemming 'stemming'. Stemming usually means using a simple heuristic process. When you use vocabulary and morphology, its usually called lemmatization rather than stemming. Lemmatization usually requires part-of-speech, too. I was gonna use my build, building, buildings example but I see wikipedia already has a nice explained example (meeting) here: http://en.wikipedia.org/wiki/Lemmatisation I believe that's covered by morphology? - Mark
Re: LucidWorks Solr
On Wed, Apr 21, 2010 at 1:30 PM, Mark Miller wrote: > > But they don't usually call 'non algorithmic' stemming 'stemming'. > Stemming usually means using a simple heuristic process. When you use > vocabulary and morphology, its usually called lemmatization rather than > stemming. > > Lemmatization usually requires part-of-speech, too. I was gonna use my build, building, buildings example but I see wikipedia already has a nice explained example (meeting) here: http://en.wikipedia.org/wiki/Lemmatisation -- Robert Muir rcm...@gmail.com
Re: LucidWorks Solr
On Apr 21, 2010, at 10:30 AM, Mark Miller wrote: > But they don't usually call 'non algorithmic' stemming 'stemming'. Stemming > usually means using a simple heuristic process. When you use vocabulary and > morphology, its usually called lemmatization rather than stemming. > "stemmer" is jargon that does not have a precise definition. For example, the LinguistX morphological analyzers are called "stemmers" and they provide options that are dictionary-based inflectional, dictionary-based derivational, and algorithmic. You can also combine those, so you can get accurate dictionary-based stems, then use an algorithmic stemmer on words not in the dictionary. Stemmers may convert the surface word to a dictionary form (inflectional), to a root dictionary form (derivational), or to a non-word key (the Porter algorithm). Arabic and Hebrew stemmers often choose an intermediate form with some vowel marks rather than the all-consonant "semetic root". Language is complicated. Maintaining a high-quality dictionary is expensive, so you probably won't find many free ones. wunder -- Walter Underwood Lead Engineer, Mark Logic
Re: LucidWorks Solr
Why do these approaches have to be mutually exclusive? Do a dictionary lookup, if no satisfactory match found use an algorithmic stemmer. Would probably save a few CPU cycles by algorithmic stemming iff necessary. On Wed, Apr 21, 2010 at 1:31 PM, Robert Muir wrote: > sy to look at the "faults" of some algorithmic stemmer, in truth its > only purpose is to cause related forms of the word to conflate to the same > form, and hopefully avoiding unrelated terms from conflating to this form. > > A dictionary-based stemmer is out-of-date the day you put it into > production: languages aren't static. For example, you can't expect a > dictionary-based stemmer to properly deal with forms like "googling" or > "tweets" that have recently slipped into English vocabulary, but an > algorithmic stemmer will likely deal with these just fine.
Re: LucidWorks Solr
On Wed, Apr 21, 2010 at 1:18 PM, Chris Hostetter wrote: > > Strictly speaking: you haven't "ditched" stemmers altogether -- you've > ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer > -- but it's still a stemmer. > > (i just don't want people reading this thread to be confused about > terminology) > > I agree, and dictionary-based stemming has its own set of problems. While its easy to look at the "faults" of some algorithmic stemmer, in truth its only purpose is to cause related forms of the word to conflate to the same form, and hopefully avoiding unrelated terms from conflating to this form. A dictionary-based stemmer is out-of-date the day you put it into production: languages aren't static. For example, you can't expect a dictionary-based stemmer to properly deal with forms like "googling" or "tweets" that have recently slipped into English vocabulary, but an algorithmic stemmer will likely deal with these just fine. -- Robert Muir rcm...@gmail.com
Re: LucidWorks Solr
On 4/21/10 1:18 PM, Chris Hostetter wrote: : Regarding stemmers, I ditched them altogether a long time ago in favor : of a dictionary of morphologies of all known words (for any given : language). A simple lookup of any word morphology thus produces the set, : including the correct stem. Strictly speaking: you haven't "ditched" stemmers altogether -- you've ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer -- but it's still a stemmer. (i just don't want people reading this thread to be confused about terminology) -Hoss But they don't usually call 'non algorithmic' stemming 'stemming'. Stemming usually means using a simple heuristic process. When you use vocabulary and morphology, its usually called lemmatization rather than stemming. - Mark
Re: LucidWorks Solr
: Regarding stemmers, I ditched them altogether a long time ago in favor : of a dictionary of morphologies of all known words (for any given : language). A simple lookup of any word morphology thus produces the set, : including the correct stem. Strictly speaking: you haven't "ditched" stemmers altogether -- you've ditched *algorithmic* stemmers and moved to a *dictionary* based stemmer -- but it's still a stemmer. (i just don't want people reading this thread to be confused about terminology) -Hoss
Re: LucidWorks Solr
> Andy, > > This will help with smooth injection of your multilingual > documents into Solr (multilingual either in the sense of 1 > doc containing fields in multiple languages or 1 index > containing documents in different languages): > > http://sematext.com/products/multilingual-indexer/index.html Otis, Thanks for the info. Is multilingual indexer an open source project or a commercial product? That web page doesn't mention anything about either open source or a price, so it's hard to tell.
Re: LucidWorks Solr
Andy, This will help with smooth injection of your multilingual documents into Solr (multilingual either in the sense of 1 doc containing fields in multiple languages or 1 index containing documents in different languages): http://sematext.com/products/multilingual-indexer/index.html Re your other question about open-source morpho dictionaries - I don't know of any. Last time I looked for dictionaries I learned that they cost money. That said, the market for datasets is starting to grow, so you may be able to find more and cheaper dictionaries now. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Andy > To: solr-user@lucene.apache.org > Sent: Mon, April 19, 2010 8:45:40 AM > Subject: Re: LucidWorks Solr > > Thanks for the explanation Mitch. You're right. There can't be universal > stemmers. What about multi-language stemmers? I'm mostly interested in > English, Spanish, German, French, Italian. Are there any stemmers that would > handle those languages? If not, what's the recommended way to deal with > documents in multiple languages? --- On Mon, 4/19/10, MitchK < > ymailto="mailto:mitc...@web.de"; > href="mailto:mitc...@web.de";>mitc...@web.de> wrote: > From: > MitchK < > href="mailto:mitc...@web.de";>mitc...@web.de> > Subject: Re: > LucidWorks Solr > To: > href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org > > Date: Monday, April 19, 2010, 4:36 AM > > Andy, I think it is > important to know what a stemmer really > is. > > It reduces > words to their infinitves. Those infinitives do > not refer to the > > real infinitive everytime, but however: for the system, it > is an > infinitive, > since all its derivates could be reduced to the same > form. > Thats a stemmer. > > According to this, there can't > exist a stemmer for every > language, because > every language has > got its own rules of how to reduce a > word to its > > infinitive. > > If you apply a stemmer for english language on a > german > document, the > results might be unexpected. However, > sometimes it still > works good enough. > > Keep in mind > that this is an algorithm. It is not important > whether the > > created infinitive is the real infinitive. It is only > important that > most of > the derivate forms can be reduced to the same basic > form. > Please ask, if > something is not clear. > > > KStem: > The wiki[1] says that KStem is less aggressive as the > > standard stemmer. > I guess that this means that there are more rules for > how > to reduce a word > to its infinitive and according to this the > results might > be better. > > > [1] > href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem"; > target=_blank > >http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem > > > Kind regards > - Mitch > -- > View this message in > context: > target=_blank > >http://n3.nabble.com/LucidWorks-Solr-tp727341p729110.html > Sent from > the Solr - User mailing list archive at > Nabble.com. > >
Re: LucidWorks Solr
no big deal, just wanted to mention. On Mon, Apr 19, 2010 at 1:24 PM, wrote: > > This is a little bit of hijacking going on here, but > You are right. Accept my regrets. > > > > It's algorithmic. That is, there isn't a list of variants that > > stem to the same infinitive, and your statement > > "always the same infintive for any derivate of the word" > > isn't quite what happens. > > > > Stemmers will always produce the same infinitive for any given > > word, just the opposite of what you said. But it is NOT guaranteed > > that a stemmer will always produce the same infinitive for all > > derivatives. Rather it just does a pretty darn good job with some > > anomalies because the rules don't cover all the edge cases. > > > > Their *goal* is to do it perfectly, but we all know about unachievable > > goals... > > > > HTH > > Erick > > > > On Mon, Apr 19, 2010 at 12:28 PM, MitchK wrote: > > > >> > >> I am curious: > >> The idea behind a stemmer is not that he produces the correct infinitive > >> for > >> a given word. The idea is that he produces always the same infintive for > >> any > >> derivate of the word. > >> > >> What would be, if there is an unknown word? For example something like > >> slang? How does your solution works here? Does it scale? > >> > >> Thank you for sharing experiences. :) > >> > >> - Mitch > >> -- > >> View this message in context: > >> http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > >> > > > >
Re: LucidWorks Solr
> This is a little bit of hijacking going on here, but You are right. Accept my regrets. > It's algorithmic. That is, there isn't a list of variants that > stem to the same infinitive, and your statement > "always the same infintive for any derivate of the word" > isn't quite what happens. > > Stemmers will always produce the same infinitive for any given > word, just the opposite of what you said. But it is NOT guaranteed > that a stemmer will always produce the same infinitive for all > derivatives. Rather it just does a pretty darn good job with some > anomalies because the rules don't cover all the edge cases. > > Their *goal* is to do it perfectly, but we all know about unachievable > goals... > > HTH > Erick > > On Mon, Apr 19, 2010 at 12:28 PM, MitchK wrote: > >> >> I am curious: >> The idea behind a stemmer is not that he produces the correct infinitive >> for >> a given word. The idea is that he produces always the same infintive for >> any >> derivate of the word. >> >> What would be, if there is an unknown word? For example something like >> slang? How does your solution works here? Does it scale? >> >> Thank you for sharing experiences. :) >> >> - Mitch >> -- >> View this message in context: >> http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >
Re: LucidWorks Solr
My use requires a mroe correct processing of language than what you define as a stemmer. My experience with stemmers is that even with some words without a stem, it makes a new word from it. I consider those false positives. My approach is based on the need to recognize that walk, walked, walking all refer to the same lemma "walk" as is correct in grammar (not some stemmer algorithm choice). It scales fine. In fact, I use lucene with Instantiated in-memory index to perform the lookups, but one could easily use MySQL or something else. Darren > > I am curious: > The idea behind a stemmer is not that he produces the correct infinitive > for > a given word. The idea is that he produces always the same infintive for > any > derivate of the word. > > What would be, if there is an unknown word? For example something like > slang? How does your solution works here? Does it scale? > > Thank you for sharing experiences. :) > > - Mitch > -- > View this message in context: > http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: LucidWorks Solr
Yes, you are right, thank you Erick. I've lost this point and thought only of common cases, not of special ones. However, one can combine the mentioned solutions and different stem-filters in different fields, so that one can be quite (not absolutely) sure, that in most of all cases the application works as expected. - Mitch -- View this message in context: http://n3.nabble.com/LucidWorks-Solr-tp727341p730160.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
This is a little bit of hijacking going on here, but It's algorithmic. That is, there isn't a list of variants that stem to the same infinitive, and your statement "always the same infintive for any derivate of the word" isn't quite what happens. Stemmers will always produce the same infinitive for any given word, just the opposite of what you said. But it is NOT guaranteed that a stemmer will always produce the same infinitive for all derivatives. Rather it just does a pretty darn good job with some anomalies because the rules don't cover all the edge cases. Their *goal* is to do it perfectly, but we all know about unachievable goals... HTH Erick On Mon, Apr 19, 2010 at 12:28 PM, MitchK wrote: > > I am curious: > The idea behind a stemmer is not that he produces the correct infinitive > for > a given word. The idea is that he produces always the same infintive for > any > derivate of the word. > > What would be, if there is an unknown word? For example something like > slang? How does your solution works here? Does it scale? > > Thank you for sharing experiences. :) > > - Mitch > -- > View this message in context: > http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: LucidWorks Solr
I am curious: The idea behind a stemmer is not that he produces the correct infinitive for a given word. The idea is that he produces always the same infintive for any derivate of the word. What would be, if there is an unknown word? For example something like slang? How does your solution works here? Does it scale? Thank you for sharing experiences. :) - Mitch -- View this message in context: http://n3.nabble.com/LucidWorks-Solr-tp727341p730059.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
There have been some open source ones. I don't have the links handy at this moment[1]. But I parsed through the electronic dictionary and generated a database of each word and its morphologies. I got tired of lame stemmers that were wrong half the time. Computers are fast enough to do lookups on 150,000 words noawadays, there's no need for fuzzy algorithms here, IMO. Good luck! [1] google will turn up some I think. > Thanks for the tip. > > Are there any publicly available dictionary of morphologies that I could > use? Or did you build your own one? > > > --- On Mon, 4/19/10, Darren Govoni wrote: > >> From: Darren Govoni >> Subject: Re: LucidWorks Solr >> To: solr-user@lucene.apache.org >> Date: Monday, April 19, 2010, 7:39 AM >> Regarding stemmers, I ditched them >> altogether a long time ago in favor >> of a dictionary of morphologies of all known words (for any >> given >> language). A simple lookup of any word morphology thus >> produces the set, >> including the correct stem. >> >> Works great. 100% of the time. >> >> Just a tip from me. >> >> >> On Mon, 2010-04-19 at 00:36 -0800, MitchK wrote: >> >> > Andy, I think it is important to know what a stemmer >> really is. >> > >> > It reduces words to their infinitves. Those >> infinitives do not refer to the >> > real infinitive everytime, but however: for the >> system, it is an infinitive, >> > since all its derivates could be reduced to the same >> form. >> > Thats a stemmer. >> > >> > According to this, there can't exist a stemmer for >> every language, because >> > every language has got its own rules of how to reduce >> a word to its >> > infinitive. >> > >> > If you apply a stemmer for english language on a >> german document, the >> > results might be unexpected. However, sometimes it >> still works good enough. >> > >> > Keep in mind that this is an algorithm. It is not >> important whether the >> > created infinitive is the real infinitive. It is only >> important that most of >> > the derivate forms can be reduced to the same basic >> form. Please ask, if >> > something is not clear. >> > >> > KStem: >> > The wiki[1] says that KStem is less aggressive as the >> standard stemmer. >> > I guess that this means that there are more rules for >> how to reduce a word >> > to its infinitive and according to this the results >> might be better. >> > >> > >> > [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem >> > >> > Kind regards >> > - Mitch >> >> >> > > > >
Re: LucidWorks Solr
Thanks for the tip. Are there any publicly available dictionary of morphologies that I could use? Or did you build your own one? --- On Mon, 4/19/10, Darren Govoni wrote: > From: Darren Govoni > Subject: Re: LucidWorks Solr > To: solr-user@lucene.apache.org > Date: Monday, April 19, 2010, 7:39 AM > Regarding stemmers, I ditched them > altogether a long time ago in favor > of a dictionary of morphologies of all known words (for any > given > language). A simple lookup of any word morphology thus > produces the set, > including the correct stem. > > Works great. 100% of the time. > > Just a tip from me. > > > On Mon, 2010-04-19 at 00:36 -0800, MitchK wrote: > > > Andy, I think it is important to know what a stemmer > really is. > > > > It reduces words to their infinitves. Those > infinitives do not refer to the > > real infinitive everytime, but however: for the > system, it is an infinitive, > > since all its derivates could be reduced to the same > form. > > Thats a stemmer. > > > > According to this, there can't exist a stemmer for > every language, because > > every language has got its own rules of how to reduce > a word to its > > infinitive. > > > > If you apply a stemmer for english language on a > german document, the > > results might be unexpected. However, sometimes it > still works good enough. > > > > Keep in mind that this is an algorithm. It is not > important whether the > > created infinitive is the real infinitive. It is only > important that most of > > the derivate forms can be reduced to the same basic > form. Please ask, if > > something is not clear. > > > > KStem: > > The wiki[1] says that KStem is less aggressive as the > standard stemmer. > > I guess that this means that there are more rules for > how to reduce a word > > to its infinitive and according to this the results > might be better. > > > > > > [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem > > > > Kind regards > > - Mitch > > >
Re: LucidWorks Solr
Thanks for the explanation Mitch. You're right. There can't be universal stemmers. What about multi-language stemmers? I'm mostly interested in English, Spanish, German, French, Italian. Are there any stemmers that would handle those languages? If not, what's the recommended way to deal with documents in multiple languages? --- On Mon, 4/19/10, MitchK wrote: > From: MitchK > Subject: Re: LucidWorks Solr > To: solr-user@lucene.apache.org > Date: Monday, April 19, 2010, 4:36 AM > > Andy, I think it is important to know what a stemmer really > is. > > It reduces words to their infinitves. Those infinitives do > not refer to the > real infinitive everytime, but however: for the system, it > is an infinitive, > since all its derivates could be reduced to the same form. > Thats a stemmer. > > According to this, there can't exist a stemmer for every > language, because > every language has got its own rules of how to reduce a > word to its > infinitive. > > If you apply a stemmer for english language on a german > document, the > results might be unexpected. However, sometimes it still > works good enough. > > Keep in mind that this is an algorithm. It is not important > whether the > created infinitive is the real infinitive. It is only > important that most of > the derivate forms can be reduced to the same basic form. > Please ask, if > something is not clear. > > KStem: > The wiki[1] says that KStem is less aggressive as the > standard stemmer. > I guess that this means that there are more rules for how > to reduce a word > to its infinitive and according to this the results might > be better. > > > [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem > > Kind regards > - Mitch > -- > View this message in context: > http://n3.nabble.com/LucidWorks-Solr-tp727341p729110.html > Sent from the Solr - User mailing list archive at > Nabble.com. >
Re: LucidWorks Solr
Regarding stemmers, I ditched them altogether a long time ago in favor of a dictionary of morphologies of all known words (for any given language). A simple lookup of any word morphology thus produces the set, including the correct stem. Works great. 100% of the time. Just a tip from me. On Mon, 2010-04-19 at 00:36 -0800, MitchK wrote: > Andy, I think it is important to know what a stemmer really is. > > It reduces words to their infinitves. Those infinitives do not refer to the > real infinitive everytime, but however: for the system, it is an infinitive, > since all its derivates could be reduced to the same form. > Thats a stemmer. > > According to this, there can't exist a stemmer for every language, because > every language has got its own rules of how to reduce a word to its > infinitive. > > If you apply a stemmer for english language on a german document, the > results might be unexpected. However, sometimes it still works good enough. > > Keep in mind that this is an algorithm. It is not important whether the > created infinitive is the real infinitive. It is only important that most of > the derivate forms can be reduced to the same basic form. Please ask, if > something is not clear. > > KStem: > The wiki[1] says that KStem is less aggressive as the standard stemmer. > I guess that this means that there are more rules for how to reduce a word > to its infinitive and according to this the results might be better. > > > [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem > > Kind regards > - Mitch
Re: LucidWorks Solr
Andy, I think it is important to know what a stemmer really is. It reduces words to their infinitves. Those infinitives do not refer to the real infinitive everytime, but however: for the system, it is an infinitive, since all its derivates could be reduced to the same form. Thats a stemmer. According to this, there can't exist a stemmer for every language, because every language has got its own rules of how to reduce a word to its infinitive. If you apply a stemmer for english language on a german document, the results might be unexpected. However, sometimes it still works good enough. Keep in mind that this is an algorithm. It is not important whether the created infinitive is the real infinitive. It is only important that most of the derivate forms can be reduced to the same basic form. Please ask, if something is not clear. KStem: The wiki[1] says that KStem is less aggressive as the standard stemmer. I guess that this means that there are more rules for how to reduce a word to its infinitive and according to this the results might be better. [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem Kind regards - Mitch -- View this message in context: http://n3.nabble.com/LucidWorks-Solr-tp727341p729110.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
--- On Sun, 4/18/10, Grant Ingersoll wrote: > > Sure, but I'm biased. ;-) Hopefully, you will find it > useful, but choose the one that best fits your needs (and > let me know if you need help assessing that.) > Thanks for the explanation Grant. WHat is the advantage of KStem over the standard Solr stemmer? On your website it was mentioned that KStem only works for English. What would happen if some of my documents are in other languages? What about the standard Solr stemmer -- does it also work on English only? Is there a stemmer that's sort of "universal" & work on multiple languages?
Re: LucidWorks Solr
On Apr 18, 2010, at 3:53 AM, Andy wrote: > Just wanted to know if anyone has used LucidWorks Solr. > > - How do you compare it to the standard Apache Solr? We take a release of Solr. We wrap it w/ an installer, tomcat/jetty, our reference guide, Luke, etc. We also add in an optimized version of KStem. Finally, we apply certain patches that came after whatever the release was that didn't make it into the release (we usually delay our release by a few weeks). Many of these things we package simply cannot be in an ASF release b/c of ASF policies, others are there for convenience so that people don't have to go all over the web to get them. > > - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk > IO? what are its effects? I think this is a legacy from the 1.3 CD on our website. I believe what this is referring to is in Solr 1.4, as it was a patch that was applied to trunk after 1.3 was released. I'll let our web team know to update that. > > - LucidWorks website also talked about "significantly improved faceting > performance" -- what improvements are they? How much improvements? Same as the previous issue. I'll let our web team know to update that. > > Would you recommend using it? > Sure, but I'm biased. ;-) Hopefully, you will find it useful, but choose the one that best fits your needs (and let me know if you need help assessing that.) -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: LucidWorks Solr
Thanks for asking, I am interested as well in reading the response to your questions. Paolo Andy wrote: Just wanted to know if anyone has used LucidWorks Solr. - How do you compare it to the standard Apache Solr? - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk IO? what are its effects? - LucidWorks website also talked about "significantly improved faceting performance" -- what improvements are they? How much improvements? Would you recommend using it? Thanks.
LucidWorks Solr
Just wanted to know if anyone has used LucidWorks Solr. - How do you compare it to the standard Apache Solr? - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk IO? what are its effects? - LucidWorks website also talked about "significantly improved faceting performance" -- what improvements are they? How much improvements? Would you recommend using it? Thanks.
Re: LucidWorks Solr
For my purposes, the Porter analyzer was overly aggressive with stemming. So, we then moved to KStem. It looks like this is no longer being maintained and Lucid claimed much better performance with theirs, so I gave that a try and it seems to be working fine. I didn't do any benchmarks though. And I just took the war in LucidWorks\dist. I think in the install instructions, there was also a script to apply to the included source code as well. I did that as well since I look at the source regularly. I didn't look at LudidGlaze or any of the other Lucid features. -Kevin From: blargy To: solr-user@lucene.apache.org Sent: Tue, March 16, 2010 12:31:09 PM Subject: Re: LucidWorks Solr Kevin, When you say you just included the war you mean the /packs/solr.war correct? I see that the KStemmer is nicely packed in there but I don't see LucidGaze anywhere. Have you had any experience using this? So I'm guessing you would suggest using the LucidWorks solr.war over the apache-solr-war just because of the various bug-fixes/tests. As a side question. Is there a reason you choose the LucidKStemmer over any other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would work best. Thanks again! Kevin Osborn-2 wrote: > > I used it mostly for KStemmer, but I also liked the fact that it included > about a dozen or so stable patches since Solr 1.4 was released. We just > use the included WAR in our project however. We don't use the installer or > anything like that. > > > > > > > From: blargy > To: solr-user@lucene.apache.org > Sent: Tue, March 16, 2010 11:52:17 AM > Subject: LucidWorks Solr > > > Has anyone used this?: > http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr > > Other than the KStemmer and installer what are the other "enhancements" > that > this download offers? Is it worth using over the default Solr > installation? > > Thanks > > -- > View this message in context: > http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html > Sent from the Solr - User mailing list archive at Nabble.com. > > > > -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
Kevin, When you say you just included the war you mean the /packs/solr.war correct? I see that the KStemmer is nicely packed in there but I don't see LucidGaze anywhere. Have you had any experience using this? So I'm guessing you would suggest using the LucidWorks solr.war over the apache-solr-war just because of the various bug-fixes/tests. As a side question. Is there a reason you choose the LucidKStemmer over any other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would work best. Thanks again! Kevin Osborn-2 wrote: > > I used it mostly for KStemmer, but I also liked the fact that it included > about a dozen or so stable patches since Solr 1.4 was released. We just > use the included WAR in our project however. We don't use the installer or > anything like that. > > > > > > > From: blargy > To: solr-user@lucene.apache.org > Sent: Tue, March 16, 2010 11:52:17 AM > Subject: LucidWorks Solr > > > Has anyone used this?: > http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr > > Other than the KStemmer and installer what are the other "enhancements" > that > this download offers? Is it worth using over the default Solr > installation? > > Thanks > > -- > View this message in context: > http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html > Sent from the Solr - User mailing list archive at Nabble.com. > > > > -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
I'm trying it out right now. I hope it will work well out-of-box for indexing/searching a set of documents with frequent update. -aj On Tue, Mar 16, 2010 at 11:52 AM, blargy wrote: > > Has anyone used this?: > http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr > > Other than the KStemmer and installer what are the other "enhancements" > that > this download offers? Is it worth using over the default Solr installation? > > Thanks > > -- > View this message in context: > http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA 650-283-4091 *Building social media monitoring pipeline, and connecting social customers to CRM*
Re: LucidWorks Solr
I used it mostly for KStemmer, but I also liked the fact that it included about a dozen or so stable patches since Solr 1.4 was released. We just use the included WAR in our project however. We don't use the installer or anything like that. From: blargy To: solr-user@lucene.apache.org Sent: Tue, March 16, 2010 11:52:17 AM Subject: LucidWorks Solr Has anyone used this?: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr Other than the KStemmer and installer what are the other "enhancements" that this download offers? Is it worth using over the default Solr installation? Thanks -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html Sent from the Solr - User mailing list archive at Nabble.com.
What is the process to build Lucidworks Solr?
I am using LucidWorks Solr v1.4 and I would like to compile in a search component, however it does not seem like a very straightforward process. The ant script in the solr directory is that of the stock solr installation which does not compile out of the box. Has anyone been able to successfully compile Lucidworks Solr?