Re: wildcards, stemming and searching
How would you deal with a query like a*z though? I suspect, however, that you only care about suffix queries and stemming those. If thats the case, then you could subclass getWildcardQuery and do internal stemming (remove trailing wildcard, run it through the analyzer directly there and return a modified WildcardQuery instance. With wildcard queries though, this is risky. Prefixes won't necessarily stem to what the full word would stem to. Erik On Feb 9, 2005, at 6:26 PM, aaz wrote: Hi, We are not using QueryParser and have some custom Query construction. We have an index that indexes various documents. Each document is Analyzed and indexed via StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - PorterStemFilter() We also want to support wildcard queries, hence on an inbound query we need to deal with * in the value side of the comparison. We also need to analyze the value side of the query against the same analyzer in which the index was built with. This leads to some problems and would like your solution opinion. User queries. somefield = united* After the analyzer hits united*, we get back unit. Hence we cannot detect that the user requested a wildcard. Lets say we come up with some solution to escape the * char before the Analyzer hits it. For example somefield = united* - unitedXXWILDCARDXX After analysis this then becomes unitedxxwildcardxx, which we can then turn into a WildcardQuery united* The problem here is that the term united will never exist in the indexing due to the stemming which did not stem properly due to our escape mechanism. How can I solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcards, stemming and searching
How would you deal with a query like a*z though? Yeah I know, a user submitting that is certainly possible. I have no idea. I am starting to think that NOT stemming on indexing might be the safest solution. - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 10, 2005 8:55 AM Subject: Re: wildcards, stemming and searching How would you deal with a query like a*z though? I suspect, however, that you only care about suffix queries and stemming those. If thats the case, then you could subclass getWildcardQuery and do internal stemming (remove trailing wildcard, run it through the analyzer directly there and return a modified WildcardQuery instance. With wildcard queries though, this is risky. Prefixes won't necessarily stem to what the full word would stem to. Erik On Feb 9, 2005, at 6:26 PM, aaz wrote: Hi, We are not using QueryParser and have some custom Query construction. We have an index that indexes various documents. Each document is Analyzed and indexed via StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - PorterStemFilter() We also want to support wildcard queries, hence on an inbound query we need to deal with * in the value side of the comparison. We also need to analyze the value side of the query against the same analyzer in which the index was built with. This leads to some problems and would like your solution opinion. User queries. somefield = united* After the analyzer hits united*, we get back unit. Hence we cannot detect that the user requested a wildcard. Lets say we come up with some solution to escape the * char before the Analyzer hits it. For example somefield = united* - unitedXXWILDCARDXX After analysis this then becomes unitedxxwildcardxx, which we can then turn into a WildcardQuery united* The problem here is that the term united will never exist in the indexing due to the stemming which did not stem properly due to our escape mechanism. How can I solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
wildcards, stemming and searching
Hi, We are not using QueryParser and have some custom Query construction. We have an index that indexes various documents. Each document is Analyzed and indexed via StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - PorterStemFilter() We also want to support wildcard queries, hence on an inbound query we need to deal with * in the value side of the comparison. We also need to analyze the value side of the query against the same analyzer in which the index was built with. This leads to some problems and would like your solution opinion. User queries. somefield = united* After the analyzer hits united*, we get back unit. Hence we cannot detect that the user requested a wildcard. Lets say we come up with some solution to escape the * char before the Analyzer hits it. For example somefield = united* - unitedXXWILDCARDXX After analysis this then becomes unitedxxwildcardxx, which we can then turn into a WildcardQuery united* The problem here is that the term united will never exist in the indexing due to the stemming which did not stem properly due to our escape mechanism. How can I solve this problem?
RE: Stemming
Do stemming algorithms take into consideration abbreviations too? Some examples: mg = milligrams US = United States CD = compact disc vcr = video casette recorder And, the next logical question, if stemming does not take care of abbreviations, are there any solutions that include abbreviations inside or outside of Lucene? Thanks, Kevin -Original Message- From: Chris Lamprecht [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 5:51 PM To: Lucene Users List Subject: Re: Stemming Also if you can't wait, see page 2 of http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html or the LIA e-book ;) On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb [EMAIL PROTECTED] wrote: OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with Lucene. Might as well take the final plunge. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 9:12 AM To: Lucene Users List Subject: Re: Stemming Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming
On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote: Do stemming algorithms take into consideration abbreviations too? No, they don't. Adding abbreviations, aliases, synonyms, etc is not stemming. And, the next logical question, if stemming does not take care of abbreviations, are there any solutions that include abbreviations inside or outside of Lucene? Nothing built into Lucene does this, but the infrastructure allows it to be added in the form of a custom analysis step. There are two basic approaches, adding aliases at indexing time, or adding them at query time by expanding the query. I created some example analyzers in Lucene in Action (grab the source code from the site linked below) that demonstrate how this can be done using WordNet (and mock) synonym lookup. You could extrapolate this into looking up abbreviations and adding them into the token stream. http://www.lucenebook.com/search?query=synonyms Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Morus Walter wrote: Owen Densmore writes: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? rule based stemmers such as porter/snowball cannot do that. But there are (commercial) dictionary based tools that can. E.g. the canoo lemmatizer. You might also have a look at egothors stemmer, that are word list based. Egothor stemmers are algorithmic, they only use word lists for training. Stems produced by them are usually closer to lemmas than in e.g. Porter's stemmer, but there is a significant amount of stems like in the example above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. It is possible to derive the human-readable form of a stemmed term using either re-analysis of indexed content or TermPositionVector. Either of these techniques should give you the position data required to discover the original form. The highlighter package is one example of where this technique is used. Cheers Mark ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stemming
I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin
Re: Stemming
Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stemming
OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with Lucene. Might as well take the final plunge. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 9:12 AM To: Lucene Users List Subject: Re: Stemming Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming
Also if you can't wait, see page 2 of http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html or the LIA e-book ;) On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb [EMAIL PROTECTED] wrote: OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with Lucene. Might as well take the final plunge. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 9:12 AM To: Lucene Users List Subject: Re: Stemming Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- Kevin L. Cobb [EMAIL PROTECTED] wrote: I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie: Human Readable Stemming, Lucene Architecture, etc!
Hi .. I'm new to the list so forgive a dumb question or two as I get started. We're in the midst of converting a small collection (1200-1500 currently) of scientific literature to be easily searchable/navigable. We'll likely provide both a text query interface as well as a graphical way to search and discover. Our initial approach will be vector based, looking at Latent Semantic Indexing (LSI) as a potential tool, although if that's not needed, we'll stop at reasonably simple stemming with a weighted document term matrix (DTM). (Bear in mind I couldn't even pronounce most of these concepts last week, so go easy if I'm incoherent!) It looks to me that Lucene has a quite well factored architecture. I should at the very least be able to use the analyzer and stemmer to create a good starting point in the project. I'd also like to leave a nice architecture behind in case we or others end up experimenting with, or extending, the system. So a couple of questions: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? 2 - We're probably using Lucene in ways it was not designed for, such as DTM/LSI and graphical clustering and navigation. Naturally we'll provide code for these parts that are not in Lucene. But the question arises: is this kinda dumb?! Has anyone stretched Lucene's design center with positive results? Are we barking up the wrong tree? 3 - A nit on hyphenation: Our collection is scientific so has many hyphenated words. I'm wondering about your experiences with hyphenation. In our collection, things like self-organization, power-law, space-time, small-world, agent-based, etc. occur often, for example. So the question is: Do folks break up hyphenated words? If not, do you stem the parts and glue them back together? Do you apply stoplists to the parts? Thanks for any help and pointers you can fling along, Owenhttp://backspaces.net/http://redfish.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Hi, One thing to point out. I think Lucene is not using LSI as the underlying retrieval model. It uses vector space model and also proximity based retrieval. Personally, I don't know much about LSI and I don't think the fancy stuff like LSI is workable in industry. I believe we are far away from the era of artificial intelligence and using any elusive way to do information retrieval. Cheers, Jian On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore [EMAIL PROTECTED] wrote: Hi .. I'm new to the list so forgive a dumb question or two as I get started. We're in the midst of converting a small collection (1200-1500 currently) of scientific literature to be easily searchable/navigable. We'll likely provide both a text query interface as well as a graphical way to search and discover. Our initial approach will be vector based, looking at Latent Semantic Indexing (LSI) as a potential tool, although if that's not needed, we'll stop at reasonably simple stemming with a weighted document term matrix (DTM). (Bear in mind I couldn't even pronounce most of these concepts last week, so go easy if I'm incoherent!) It looks to me that Lucene has a quite well factored architecture. I should at the very least be able to use the analyzer and stemmer to create a good starting point in the project. I'd also like to leave a nice architecture behind in case we or others end up experimenting with, or extending, the system. So a couple of questions: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? 2 - We're probably using Lucene in ways it was not designed for, such as DTM/LSI and graphical clustering and navigation. Naturally we'll provide code for these parts that are not in Lucene. But the question arises: is this kinda dumb?! Has anyone stretched Lucene's design center with positive results? Are we barking up the wrong tree? 3 - A nit on hyphenation: Our collection is scientific so has many hyphenated words. I'm wondering about your experiences with hyphenation. In our collection, things like self-organization, power-law, space-time, small-world, agent-based, etc. occur often, for example. So the question is: Do folks break up hyphenated words? If not, do you stem the parts and glue them back together? Do you apply stoplists to the parts? Thanks for any help and pointers you can fling along, Owenhttp://backspaces.net/http://redfish.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Like any other field, A.I. is only elusive until you master it. There are plenty of companies using A.I. techniques in various IR applications successfully. LSI in particular has been around a long time and is well understood. Chuck -Original Message- From: jian chen [mailto:[EMAIL PROTECTED] Sent: Thursday, January 20, 2005 2:10 PM To: Lucene Users List Subject: Re: Newbie: Human Readable Stemming, Lucene Architecture, etc! Hi, One thing to point out. I think Lucene is not using LSI as the underlying retrieval model. It uses vector space model and also proximity based retrieval. Personally, I don't know much about LSI and I don't think the fancy stuff like LSI is workable in industry. I believe we are far away from the era of artificial intelligence and using any elusive way to do information retrieval. Cheers, Jian On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore [EMAIL PROTECTED] wrote: Hi .. I'm new to the list so forgive a dumb question or two as I get started. We're in the midst of converting a small collection (1200-1500 currently) of scientific literature to be easily searchable/navigable. We'll likely provide both a text query interface as well as a graphical way to search and discover. Our initial approach will be vector based, looking at Latent Semantic Indexing (LSI) as a potential tool, although if that's not needed, we'll stop at reasonably simple stemming with a weighted document term matrix (DTM). (Bear in mind I couldn't even pronounce most of these concepts last week, so go easy if I'm incoherent!) It looks to me that Lucene has a quite well factored architecture. I should at the very least be able to use the analyzer and stemmer to create a good starting point in the project. I'd also like to leave a nice architecture behind in case we or others end up experimenting with, or extending, the system. So a couple of questions: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? 2 - We're probably using Lucene in ways it was not designed for, such as DTM/LSI and graphical clustering and navigation. Naturally we'll provide code for these parts that are not in Lucene. But the question arises: is this kinda dumb?! Has anyone stretched Lucene's design center with positive results? Are we barking up the wrong tree? 3 - A nit on hyphenation: Our collection is scientific so has many hyphenated words. I'm wondering about your experiences with hyphenation. In our collection, things like self-organization, power-law, space-time, small-world, agent-based, etc. occur often, for example. So the question is: Do folks break up hyphenated words? If not, do you stem the parts and glue them back together? Do you apply stoplists to the parts? Thanks for any help and pointers you can fling along, Owenhttp://backspaces.net/http://redfish.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Owen Densmore writes: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? rule based stemmers such as porter/snowball cannot do that. But there are (commercial) dictionary based tools that can. E.g. the canoo lemmatizer. You might also have a look at egothors stemmer, that are word list based. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query based stemming
Hi, I'm new to Lucene, so I apologize if this issue has been discussed before (I'm sure it has), but I had a hard time finding an answer using google. (Maybe this would be a good candidate for the FAQ!) :) Is it possible to enable stem queries on a per-query basis? It doesn't seem to be possible since the stem tokenizing is done during the indexing process. Are people basically stuck with having all their queries stemmed or none at all? Thanks! Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query based stemming
From what I've read, if you want to have a choice, the easiest way is to index the documents twice. Once with stemming on and once with it off placing the results in two different indexes. Then at query time, select which index you want to use based on whether you want stemming on or off. Jim. Peter Kim wrote: Hi, I'm new to Lucene, so I apologize if this issue has been discussed before (I'm sure it has), but I had a hard time finding an answer using google. (Maybe this would be a good candidate for the FAQ!) :) Is it possible to enable stem queries on a per-query basis? It doesn't seem to be possible since the stem tokenizing is done during the indexing process. Are people basically stuck with having all their queries stemmed or none at all? Thanks! Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query based stemming
: Is it possible to enable stem queries on a per-query basis? It doesn't : seem to be possible since the stem tokenizing is done during the : indexing process. Are people basically stuck with having all their : queries stemmed or none at all? : From what I've read, if you want to have a choice, the easiest way is : to index the documents twice. Once with stemming on and once with it off : placing the results in two different indexes. Then at query time, : select which index you want to use based on whether you want stemming on : or off. As I understand it, the intented place to impliment Stemming is in an Analyzer Filter (not to be confused with a search Filter). Since you can can specify an Analyzer when you call addDocument, you don't have to acctually have two seperate indexes, you could just have all the docs in one index - and use a search Filter to indicate which docs to look at. Alternately: the Analyzer's tokenStream method is given the fieldName being analyzed, so you could write an Analyzer with a set of rules telling it to only apply your Stemming filter to certain fields, and then instead of having twice as many documents, you can just index your text in two seperate fields (which should be a little easier, then seperate docs because you are only duplicating the fields where stemming is relevant) Then at search time you don't have to filter anything, just search the field that's applicable to your current desire (stemmed or unstemmed) Lastely: Allthough it's tricky to get correct, there's no law saying you have to use the same Analyzer when you query as when you index. You could index your documents using an Analyzer that does no stemming, and then at search time (if you want stemming) use an Analyzer that does reverse stemming to expand your query terms out to all the possible variants. (NOTE: I've never acctaully tried this, but i think the theory is sound). -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
about Stemming
Hi, I have used the DEMOS of lucene and I want to know as it is possible to be added Stemming for my applications. -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: about Stemming
Miguel Angel schrieb: Hi, I have used the DEMOS of lucene and I want to know as it is possible to be added Stemming for my applications. have a look to the lucene-sandbox. Under contributions there are stemmers for many different languages. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming Oddness
Hi Yousef You are not doing anything wrong - its just how the Porter stemmer works! The problem with Porter is that it tries to do everything in a purely algorithmic way - which doesn't cater for irregular conjugations etc. Don't worry too much though, as long as you do the same stemming on the query string as you did while indexing - you should be able to find what you are looking for but can have some issues with trailing wildcards. If you want a better stemmer, look for something that has a dictionary as well as algorithmic rules - a quick one that is readily available is Kstem which while not perfect I think is quite a bit better than Porter. You can get the source code (Kstem.jar) from the floowing website: http://ciir.cs.umass.edu/downloads/ For more info on Kstem see the paper by its designer Bob Krovetz at: http://ciir.cs.umass.edu/pubfiles/ir-35.pdf Cheers Pete - Original Message - From: Yousef Ourabi [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, November 06, 2004 1:13 AM Subject: Stemming Oddness Hey, Thanks for everyone's reply to my last post, I have some quesiton. I imported the PorterStemmer and when I did the following PorterStemmer ps = new PorterStemmer(); string r1 = ps.stem(elephant); r1 is 'eleph' also buying stems to bui, is this normal? Am I doing something wrong. I am calling reset inbetween function calls. Thanks, Yousef - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stemming options
Has anyone on the list implemented a dictionary-based English stemmer with Lucene? Perhaps based on the freely-available ispell dictionaries or something like that? The Porter and Snowball stemmers have not worked that well for our application, but it is a bit daunting to start from scratch in developing an alternate stemmer. Alternatively, is there an algorithmic stemmer that anyone has used which is a little less aggressive than the Porter algorithm? We've been having problems with searches for conversion returning converse and conversational; and animal returning animate. Yes, these are morphologically related, but in our particular application it would be better to stick with removing simple inflections. Thanks for any pointers -- Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem with tokenizing/stemming in GermanAnalyzer
Hi, my application uses a GermanAnalyzer for tokenizing a search string and constructing Query classes: Analyzer an = new org.apache.lucene.analysis.de.GermanAnalyzer(); TokenStream ts = an.tokenStream(fieldName, new StringReader(fieldText)); I have noticed a strange problem with capitalization. Search for computer results in the token compu. Search for Computer, however, results in comput. The search is supposed to be case-insensitive, so this must be a bug, right? Volker - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem with tokenizing/stemming in GermanAnalyzer
For now you could check out the current lucene version from cvs and just comment out the following line: uppercase = Character.isUpperCase( term.charAt( 0 ) ); In GermanStemmer.java of course ;)) Regards Christoph - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
stemming feature
Hi all Does the lucene will do stemming of a word? If yes can anyone say how to do it in java using lucene api. Thanks rgds srinivas __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
stemming feature
Hi all Can anyone tell, where can i get the process flow diagrams kind of thing for lucene. I want to know how lucene works. Thanks rgds srinivas __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: stemming feature
Check out PorterStemFilter class and Analyzer class. Then look at some Analyzer implementations and see how to implement your own PorterAnalyzer. Otis --- M Srinivas Rao [EMAIL PROTECTED] wrote: Hi all Does the lucene will do stemming of a word? If yes can anyone say how to do it in java using lucene api. Thanks rgds srinivas __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Stemming
In our search application the user can turn stemming off and on. With Lucene will I have to maintain two sets of indexes to create this functionality, one stemming and one non-stemming index? Or Is there a way to query a stemming index so that it does not return stems? Thanks, Joel
Re: Stemming
You could have a single index with both stemmed and non-stemmed terms, using different field names for each and searching a different set of fields depending on the type of search. You'd also have to use 2 types of analyzers/filters, I think. Roughly :) Otis --- Joel Bernstein [EMAIL PROTECTED] wrote: In our search application the user can turn stemming off and on. With Lucene will I have to maintain two sets of indexes to create this functionality, one stemming and one non-stemming index? Or Is there a way to query a stemming index so that it does not return stems? Thanks, Joel __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]