RE: Multi-language indexing and searching

Ken Krugler Tue, 12 Jun 2007 09:54:34 -0700

Daniel,
I was reading your email and responses to it with great
interest.


I was aware that Solr has an implicit assumption that
a field is mono-lingual per system. But your mail and
its correspondence made me wonder if this limitation
is practical for multi-lingual search applications.  For bi-lingual
or tri-lingual search, we can have parallel fields (title_en,
title_fr, title_de, for example) but this wouldn't scale well.

Assume we are making a search application for multi-lingual
library in a university in Japan, for example,
the application would have a book title field in Japanese,
perhaps another title field in English for visiting

scholars, and a title field in the original language.The last field's field would vary among more than 50 modern

languages (and not so modern languages like Latin).  Solr
may need some rearchitecutring in this area.


[snip]

One idea I thought about here, if any given document/field set wouldonly contain text for a single language, was to write out a specialtoken with the language name. E.g. have your analyzer add a"my-special-token-prefix-esperanto" token to the field, and then atquery time (assuming you know the language) make this a required term.


-- Ken


I work for a company called Basis Technology,
(www.basistech.com) which develops a suite of language
processing software and I've written a module to integrate
this with Solr (and Lucene in general).  The module is
made of a universal Tokenizer and Analyzers for English and
Japanese, but they can be modified easily to handle any of
the 16 languages we can handle. (Source code is provided.)

When I was developing this module, I thought of writing
a super Analyzer that automatically detects the language
and do the right thing.  But I've found this won't fit
well with the design of Lucene and Solr.  For one thing,
there is no way to save the detected language in the field,
if the language is detected within the Analyzer.  Lucene and Solr
requires that the language be known before an Analyzer can be
instantiated,and it's the Analyzer that detects the language in my
design....  A second obstacle is that the kinds of Filters
the Analyzer use depends on the language, so it must be
dynamically changed. This could be done programatically but
it's not easy.  My big hope is that we can work together to
come up with some way so that the detected language within
the Analayzer can somehow be retrieved and made it into the field.

Anyway, if you are interested in trying my multi-lingual
Analyzers, please contact me in private email.

Regards,
-kuro



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

RE: Multi-language indexing and searching

Reply via email to