Out of curiousity and perhaps for practical purposes, how does one
handle mixed language documents?  I suppose one could extract the
words of a particular language and place it in a lang specific field?
Are there libraries to perform this (yet)?

On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling
<christian.reuschl...@gmail.com> wrote:
> Hi,
>
> looking up the different terms with a common stem can be useful in different
> scenarios - so I don't want to judge it whether someone needs it or not.
>
> E.g., in the case you have multilingual documents in your index, it is 
> straight
> forward to determine the language of the documents in order to choose the 
> right
> stemmer. At least this is right for document with homogenous language.
>
> Althought this is true at indexing time, the language classification for the
> user query is not such trivial - and you have to do this in order to stem the
> query terms for searching. One possibility would be to search for the stems
> given from all stemmers - but in this case you will receive many wrong
> searching terms, thus much noise in the result lists.
>
> Another possibility can be to offer all 'potential synonyms' of the query 
> terms
> to the user - where he can choose whether these are right or not. In this case
> you need exactly the lookup 'queryTerm->stem->terms with same stem'. This can
> be much more precise, the lacks are of course the interaction needed by the
> user and longer queries.
>
> To realize this, someone could write a specific Analyzer that stores this
> relationship additionally e.g. into a database. I personaly don't know any
> possibility to read this directly out of the Lucene index.
>
>
> In the case someone has best practices or an idea how processing multilingual
> indices can be done better, I would be appreciated to read / hear about this.
>
>
>
> all best
>
> Chris
>
>
> On Tue, 6 Oct 2009 16:31:36 +0900
> David Leangen <apa...@leangen.net> wrote:
>
>>
>> Hello,
>>
>> I've been using Lucene in a very basic way for some time now, and I'm
>> starting to take advantage of some of the linguistic capabilities only
>> now.
>>
>> I am making use of the snowball analyzer for stemming, and it works
>> very well.
>>
>>
>> Question: is there any such thing as a "reverse stemmer"? In other
>> words, given the stem of a word, is there any algorithm to find the
>> original word? Or is this just fantasy? ;-)
>>
>> Now, I understand that there is a 1:n mapping of stems:words. I can
>> deal with that.
>>
>>
>> Thanks!
>> =David
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to