Ilia,

one aspect you surely loose with a single field approach is the differentiation 
of semantic fields in different languages for words that sounds the same.
The words "sitting" and "directions" are easy example that have fully different 
semantics in French and English, at least.
"directions" would appear common with, say, teacher advice in English but not 
in French.

I disagree that the storage should be an issue in your caseā€¦. most solr 
installations do not suffer from that, as far as I can read the list. 
Generally, you do not need all these stemmed fields to be stored, they're just 
indexed and that is pretty tiny a storage.

Using separate fields also has advantages in terms of IDF, I think.

I do not understand the last question to Tom, he provides URLs to at least one 
of the papers.

Also, if you can put a hand on it, the book of Peters, Braschler, and Clough is 
probably relevant: http://link.springer.com/book/10.1007%2F978-3-642-23008-0 
but, as the first article referenced by Tom says, the CLIR approach here relies 
on parallel corpora, e.g. created by automatic translations.


Paul




On 8 sept. 2014, at 07:33, Ilia Sretenskii <sreten...@multivi.ru> wrote:

> Thank you for the replies, guys!
> 
> Using field-per-language approach for multilingual content is the last
> thing I would try since my actual task is to implement a search
> functionality which would implement relatively the same possibilities for
> every known world language.
> The closest references are those popular web search engines, they seem to
> serve worldwide users with their different languages and even
> cross-language queries as well.
> Thus, a field-per-language approach would be a sure waste of storage
> resources due to the high number of duplicates, since there are over 200
> known languages.
> I really would like to keep single field for cross-language searchable text
> content, witout splitting it into specific language fields or specific
> language cores.
> 
> So my current choice will be to stay with just the ICUTokenizer and
> ICUFoldingFilter as they are without any language specific
> stemmers/lemmatizers yet at all.
> 
> Probably I will put the most popular languages stop words filters and
> stemmers into the same one searchable text field to give it a try and see
> if it works correctly in a stack.
> Does specific language related filters stacking work correctly in one field?
> 
> Further development will most likely involve some advanced custom analyzers
> like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> ScriptAttribute.
> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
> https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
> 
> So I would like to know more about those "academic papers on this issue of
> how best to deal with mixed language/mixed script queries and documents".
> Tom, could you please share them?

Reply via email to