Re: Choosing tokenizer based on language of document

Dominique Bejean Fri, 06 Apr 2012 02:58:29 -0700

Hi,

Yes, I agree it is not an easy issue. Index all languages with theappropriate char filter, tokenizer and filters for each language is notpossible without new text type and new analyzer development.

If you plan to index up to 10 different languages, I suggest one textfield per language or one index per language.

One field for all language can be interesting if you plan to index a lotof different languages in the same index. In this case, have one fieldper language (text_en, text_fr, ...) can be complicated if you want theuser be able in one query to retrieve documents in any languages. Thequery will be complex if you have 50 different languages (text_en:... ORtext_fr:... OR ...).

In order to achieve this you will need to developp a specific analyzer.This analyzer will be in charge of use correct char filter, tokenizerand filters for the language of the document. You will need aconfigurable analyzer in order to change specific languages setting(enable stemming or not, chose a specific stopwords file, ...).

I did this several years ago for solr 1.4.1. This is still working forsolr 3.x. The default of this analyzer is that all language settings arehard coded (tokenizer, filters, stopwords, ..). With Solr 4.0, theanalyzer do not work anymore. I decided to redevelop it in order to beable to configure all languages settings in a external configurationfile and have nothing hardcoded.


I had to develop the analyzer but also a field type.

The main issue is in fact that the analyzer is not aware of the valuesin other fields. So it is not possible to use an other field in order tospecify the content language. The only way I found is to start contentwith a specific char sequence : [en]... or [fr]...The analyzer needs to know the language of the query too. So querycriteria for the multilingual field have to include the specific charsequence : [en]...


If you are interested by this work, let me know.

If someone knows how to provide to the analyzer the content language aindex time or the query language at query time in an other way I did, Iam interested :).


Regards.

Dominique











Le 05/04/12 23:36, Erick Erickson a écrit :

This is really difficult to imagine working well. Even if you
do choose the appropriate analysis chain (and it must
be a chain here), and manage to appropriately tokenize
for each language, what happens at query time?

How do you expect to get matches on, say, Ukranian when
the tokens of the query are in Erse?

This feels like an XY problem, can you explain at a
higher level what your requirements are?

Best
Erick

On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
<prabhu.prakashgan...@dowjones.com>  wrote:

Hi,
      I have documents in different languages and I want to choose the 
tokenizer to use for a document based on the language of the document. The 
language of the document is already known and is indexed in a field. What I 
want to do is when I index the text in the document, I want to choose the 
tokenizer to use based on the value of the language field. I want to use one 
field for the text in the document (defining multiple fields for each language 
is not an option). It seems like I can define a tokenizer for a field, so I 
guess what I need to do is to write a custom tokenizer that looks at the 
language field value of the document and calls the appropriate tokenizer for 
that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK 
languages etc..). From whatever I have read, it seems quite straight forward to 
write a custom tokenizer, but how would this custom tokenizer know the language 
of the document? Is there some way I can pass in this value to the tokenizer? 
Or is there some way the tokenizer will have access to other fields in the 
document?. Would be really helpful if someone can provide an answer

Thanks
Prabhu

Re: Choosing tokenizer based on language of document

Reply via email to