Re: Reverse stemmer?

Karl Wettin Thu, 08 Oct 2009 20:07:56 -0700

For the case where the text contains mixed languages there aresolutions that simutainously use morphological rules of two or morelanguages. Coveo search does this but I don't know what their solutionlooks like. I suppose one way to do it would be to stem all tokenswith all algorithms and add the results as synonyms, both at index andquery time. Sounds a bit expensive though. If you have enoughresources to spend at index time you could probably end up with a moreoptimized index by using dictionaries to back language identificationper word and power brute force stemming (way slow) that's compatiblewith algortihmic query time stemmers (way speedier). I'm just guessingthough.


      karl

8 okt 2009 kl. 21.20 skrev Jason Rutherglen:

Out of curiousity and perhaps for practical purposes, how does one
handle mixed language documents?  I suppose one could extract the
words of a particular language and place it in a lang specific field?
Are there libraries to perform this (yet)?

On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling
<christian.reuschl...@gmail.com> wrote:
Hi,
looking up the different terms with a common stem can be useful indifferentscenarios - so I don't want to judge it whether someone needs it ornot.
E.g., in the case you have multilingual documents in your index, itis straightforward to determine the language of the documents in order tochoose the rightstemmer. At least this is right for document with homogenouslanguage.
Althought this is true at indexing time, the languageclassification for theuser query is not such trivial - and you have to do this in orderto stem thequery terms for searching. One possibility would be to search forthe stemsgiven from all stemmers - but in this case you will receive manywrong
searching terms, thus much noise in the result lists.
Another possibility can be to offer all 'potential synonyms' of thequery termsto the user - where he can choose whether these are right or not.In this caseyou need exactly the lookup 'queryTerm->stem->terms with samestem'. This canbe much more precise, the lacks are of course the interactionneeded by the
user and longer queries.
To realize this, someone could write a specific Analyzer thatstores thisrelationship additionally e.g. into a database. I personaly don'tknow any
possibility to read this directly out of the Lucene index.
In the case someone has best practices or an idea how processingmultilingualindices can be done better, I would be appreciated to read / hearabout this.
all best

Chris


On Tue, 6 Oct 2009 16:31:36 +0900
David Leangen <apa...@leangen.net> wrote:
Hello,
I've been using Lucene in a very basic way for some time now, andI'mstarting to take advantage of some of the linguistic capabilitiesonly
now.

I am making use of the snowball analyzer for stemming, and it works
very well.


Question: is there any such thing as a "reverse stemmer"? In other
words, given the stem of a word, is there any algorithm to find the
original word? Or is this just fantasy? ;-)

Now, I understand that there is a 1:n mapping of stems:words. I can
deal with that.


Thanks!
=David



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Reverse stemmer?

Reply via email to