Another option would be to use a multi-core configuration, one for each 
language.  If you're using the java client from 1.3 you could then just have a 
base url that you append the language string to in order to pick what core 
you're searching over, (http://searchserver:1234/solr/en, 
http://searchserver:1234/solr/sp, etc.)  It wouldn't work if you were searching 
across languages, but I'm not sure how likely a Spanglish query is anyways.

Andy Warner
Appraisal/Assessment Product Lead
 
Tyler Technologies, Inc.
14142 Denver W Pkwy, Suite 155, Lakewood, CO 80401
[Phone] 303.271.9100
[Fax] 303.271.1930
[E-Mail] [EMAIL PROTECTED]
 
NYSE: TYL
www.tyler-eagle.com
-----Original Message-----
From: Eli K [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 07, 2008 2:35 PM
To: solr-user@lucene.apache.org
Subject: Re: multi-language searching with Solr

Gereon,

I think that you must have the same schema on each shard but I am not
sure if it must also have the same analyzers.
These are shards of one index and not multiple indexes.  There is
probably a way to get each shard to contain one language but then you
end up with x servers for x languages, and some will be under utilized
while other will be over utilized.

Add to that fail-over and fault tolerance and you end up with a
maintenance nightmare.  Also, how would you scale this?

Of course I am still pretty new to search and Solr/Lucene so I might be wrong :)
The different fields per language or prefixing the language string to
every term solutions suggested by Peter and Mike are starting to look
better and better.

Is it possible to write an analyzer wrapper that will also be aware of
the locale field in the document and delegate processing to the
appropriate analyzer?

Thanks,

Eli



On Wed, May 7, 2008 at 3:46 PM, Gereon Steffens <[EMAIL PROTECTED]> wrote:
> I have the same requirement, and from what I understand the distributed
> search feature will help implementing this, by having one shard per
> language. Am I right?
>
>  Gereon
>
>
>
>
>  Mike Klaas wrote:
>
> > On 5-May-08, at 1:28 PM, Eli K wrote:
> >
> >
> > > Wouldn't this impact both indexing and search performance and the size
> > > of the index?
> > > It is also probable that I will have more then one free text fields
> > > later on and with at least 20 languages this approach does not seem
> > > very manageable.  Are there other options for making this work with
> > > stemming?
> > >
> >
> > If you want stemming, then you have to execute one query per language
> anyway, since the stemming will be different in every language.
> >
> > This is a fundamental requirement: you somehow need to track the language
> of every token if you want correct multi-language stemming.  The easiest way
> to do this would be to split each language into its own field.  But there
> are other options: you could prefix every indexed token with the language:
> >
> > en:The en:quick en:brown en:fox en:jumped ...
> > fr:Le fr:brun fr:renard fr:vite fr:a fr:sauté ...
> >
> > Separate fields seems easier to me, though.
> >
> > -Mike
> >
> >
>
>
>

Reply via email to