Hi,
Yes, I agree it is not an easy issue. Index all languages with the
appropriate char filter, tokenizer and filters for each language is not
possible without new text type and new analyzer development.
If you plan to index up to 10 different languages, I suggest one text
field per language or one index per language.
One field for all language can be interesting if you plan to index a lot
of different languages in the same index. In this case, have one field
per language (text_en, text_fr, ...) can be complicated if you want the
user be able in one query to retrieve documents in any languages. The
query will be complex if you have 50 different languages (text_en:... OR
text_fr:... OR ...).
In order to achieve this you will need to developp a specific analyzer.
This analyzer will be in charge of use correct char filter, tokenizer
and filters for the language of the document. You will need a
configurable analyzer in order to change specific languages setting
(enable stemming or not, chose a specific stopwords file, ...).
I did this several years ago for solr 1.4.1. This is still working for
solr 3.x. The default of this analyzer is that all language settings are
hard coded (tokenizer, filters, stopwords, ..). With Solr 4.0, the
analyzer do not work anymore. I decided to redevelop it in order to be
able to configure all languages settings in a external configuration
file and have nothing hardcoded.
I had to develop the analyzer but also a field type.
The main issue is in fact that the analyzer is not aware of the values
in other fields. So it is not possible to use an other field in order to
specify the content language. The only way I found is to start content
with a specific char sequence : [en]... or [fr]...
The analyzer needs to know the language of the query too. So query
criteria for the multilingual field have to include the specific char
sequence : [en]...
If you are interested by this work, let me know.
If someone knows how to provide to the analyzer the content language a
index time or the query language at query time in an other way I did, I
am interested :).
Regards.
Dominique
Le 05/04/12 23:36, Erick Erickson a écrit :
This is really difficult to imagine working well. Even if you
do choose the appropriate analysis chain (and it must
be a chain here), and manage to appropriately tokenize
for each language, what happens at query time?
How do you expect to get matches on, say, Ukranian when
the tokens of the query are in Erse?
This feels like an XY problem, can you explain at a
higher level what your requirements are?
Best
Erick
On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
<prabhu.prakashgan...@dowjones.com> wrote:
Hi,
I have documents in different languages and I want to choose the
tokenizer to use for a document based on the language of the document. The
language of the document is already known and is indexed in a field. What I
want to do is when I index the text in the document, I want to choose the
tokenizer to use based on the value of the language field. I want to use one
field for the text in the document (defining multiple fields for each language
is not an option). It seems like I can define a tokenizer for a field, so I
guess what I need to do is to write a custom tokenizer that looks at the
language field value of the document and calls the appropriate tokenizer for
that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK
languages etc..). From whatever I have read, it seems quite straight forward to
write a custom tokenizer, but how would this custom tokenizer know the language
of the document? Is there some way I can pass in this value to the tokenizer?
Or is there some way the tokenizer will have access to other fields in the
document?. Would be really helpful if someone can provide an answer
Thanks
Prabhu