Re: AW: Best practices for multiple languages?

Bill Janssen Wed, 19 Jan 2011 10:22:38 -0800

Clemens Wyss <[email protected]> wrote:

> > 1) Docs in different languages -- every document is one language
> > 2) Each document has fields in different languages
> We mainly have 1)-models


I've recently done this for UpLib.  I run a language-guesser over the
document to identify the primary language when the document comes into
my repository, and save that language as part of the metadata for my
document.  When UpLib indexes the document into Lucene, it uses that
language as a key into a table of available Analyzers, and uses the
selected Analyzer for the document's text.  (I'm actually doing this on
a per-paragraph level now, but the principle is the same.)

The tricky part is the query parser.  My extended query parser allows
a pseudo-field "_query_language" to specify that the query itself is in
a particular language, in which case the appropriate Analyzer is used
for the query.

You can read the code for all this at
<http://uplib.parc.com/hg/uplib/file/e29e36f751f7/python/uplib/indexing.py>.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AW: Best practices for multiple languages?

Reply via email to