Out of curiousity and perhaps for practical purposes, how does one handle mixed language documents? I suppose one could extract the words of a particular language and place it in a lang specific field? Are there libraries to perform this (yet)?
On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling <christian.reuschl...@gmail.com> wrote: > Hi, > > looking up the different terms with a common stem can be useful in different > scenarios - so I don't want to judge it whether someone needs it or not. > > E.g., in the case you have multilingual documents in your index, it is > straight > forward to determine the language of the documents in order to choose the > right > stemmer. At least this is right for document with homogenous language. > > Althought this is true at indexing time, the language classification for the > user query is not such trivial - and you have to do this in order to stem the > query terms for searching. One possibility would be to search for the stems > given from all stemmers - but in this case you will receive many wrong > searching terms, thus much noise in the result lists. > > Another possibility can be to offer all 'potential synonyms' of the query > terms > to the user - where he can choose whether these are right or not. In this case > you need exactly the lookup 'queryTerm->stem->terms with same stem'. This can > be much more precise, the lacks are of course the interaction needed by the > user and longer queries. > > To realize this, someone could write a specific Analyzer that stores this > relationship additionally e.g. into a database. I personaly don't know any > possibility to read this directly out of the Lucene index. > > > In the case someone has best practices or an idea how processing multilingual > indices can be done better, I would be appreciated to read / hear about this. > > > > all best > > Chris > > > On Tue, 6 Oct 2009 16:31:36 +0900 > David Leangen <apa...@leangen.net> wrote: > >> >> Hello, >> >> I've been using Lucene in a very basic way for some time now, and I'm >> starting to take advantage of some of the linguistic capabilities only >> now. >> >> I am making use of the snowball analyzer for stemming, and it works >> very well. >> >> >> Question: is there any such thing as a "reverse stemmer"? In other >> words, given the stem of a word, is there any algorithm to find the >> original word? Or is this just fantasy? ;-) >> >> Now, I understand that there is a 1:n mapping of stems:words. I can >> deal with that. >> >> >> Thanks! >> =David >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org