Hi,

Yes, I agree it is not an easy issue. Index all languages with the appropriate char filter, tokenizer and filters for each language is not possible without new text type and new analyzer development.

If you plan to index up to 10 different languages, I suggest one text field per language or one index per language.

One field for all language can be interesting if you plan to index a lot of different languages in the same index. In this case, have one field per language (text_en, text_fr, ...) can be complicated if you want the user be able in one query to retrieve documents in any languages. The query will be complex if you have 50 different languages (text_en:... OR text_fr:... OR ...).

In order to achieve this you will need to developp a specific analyzer. This analyzer will be in charge of use correct char filter, tokenizer and filters for the language of the document. You will need a configurable analyzer in order to change specific languages setting (enable stemming or not, chose a specific stopwords file, ...).

I did this several years ago for solr 1.4.1. This is still working for solr 3.x. The default of this analyzer is that all language settings are hard coded (tokenizer, filters, stopwords, ..). With Solr 4.0, the analyzer do not work anymore. I decided to redevelop it in order to be able to configure all languages settings in a external configuration file and have nothing hardcoded.

I had to develop the analyzer but also a field type.

The main issue is in fact that the analyzer is not aware of the values in other fields. So it is not possible to use an other field in order to specify the content language. The only way I found is to start content with a specific char sequence : [en]... or [fr]... The analyzer needs to know the language of the query too. So query criteria for the multilingual field have to include the specific char sequence : [en]...

If you are interested by this work, let me know.

If someone knows how to provide to the analyzer the content language a index time or the query language at query time in an other way I did, I am interested :).

Regards.

Dominique











Le 05/04/12 23:36, Erick Erickson a écrit :
This is really difficult to imagine working well. Even if you
do choose the appropriate analysis chain (and it must
be a chain here), and manage to appropriately tokenize
for each language, what happens at query time?

How do you expect to get matches on, say, Ukranian when
the tokens of the query are in Erse?

This feels like an XY problem, can you explain at a
higher level what your requirements are?

Best
Erick

On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
<prabhu.prakashgan...@dowjones.com>  wrote:
Hi,
      I have documents in different languages and I want to choose the 
tokenizer to use for a document based on the language of the document. The 
language of the document is already known and is indexed in a field. What I 
want to do is when I index the text in the document, I want to choose the 
tokenizer to use based on the value of the language field. I want to use one 
field for the text in the document (defining multiple fields for each language 
is not an option). It seems like I can define a tokenizer for a field, so I 
guess what I need to do is to write a custom tokenizer that looks at the 
language field value of the document and calls the appropriate tokenizer for 
that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK 
languages etc..). From whatever I have read, it seems quite straight forward to 
write a custom tokenizer, but how would this custom tokenizer know the language 
of the document? Is there some way I can pass in this value to the tokenizer? 
Or is there some way the tokenizer will have access to other fields in the 
document?. Would be really helpful if someone can provide an answer

Thanks
Prabhu

Reply via email to