You also need to take a stance as to whether you wish to auto-detect the language at query time vs. have a UI selection of language vs. attempt to perform the same query for each available language and then "determine" which has the best "relevancy". The latter two options are very sensitive to short queries. Keep in mind that auto-detection for indexing full documents is a different problem that auto-detection for very short queries.

-- Jack Krupansky

-----Original Message----- From: Ilia Sretenskii
Sent: Sunday, September 7, 2014 10:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

Thank you for the replies, guys!

Using field-per-language approach for multilingual content is the last
thing I would try since my actual task is to implement a search
functionality which would implement relatively the same possibilities for
every known world language.
The closest references are those popular web search engines, they seem to
serve worldwide users with their different languages and even
cross-language queries as well.
Thus, a field-per-language approach would be a sure waste of storage
resources due to the high number of duplicates, since there are over 200
known languages.
I really would like to keep single field for cross-language searchable text
content, witout splitting it into specific language fields or specific
language cores.

So my current choice will be to stay with just the ICUTokenizer and
ICUFoldingFilter as they are without any language specific
stemmers/lemmatizers yet at all.

Probably I will put the most popular languages stop words filters and
stemmers into the same one searchable text field to give it a try and see
if it works correctly in a stack.
Does specific language related filters stacking work correctly in one field?

Further development will most likely involve some advanced custom analyzers
like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
ScriptAttribute.
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java

So I would like to know more about those "academic papers on this issue of
how best to deal with mixed language/mixed script queries and documents".
Tom, could you please share them?

Reply via email to