I think you would have to declare a separate field for each language
(freetext_en, freetext_fr, etc.), each with its own appropriate
stemming. Your ingestion process would have to assign the free text
content for each document to the appropriate field; so, for each
document, only one of the freetext fields would be populated. At search
time, you would either search against the appropriate field if you know
the search language, or search across them with "freetext_fr:query OR
freetext_en:query OR ...". That way your query will be interpreted by
each language field using that language's stemming rules. 

Other options for combining indexes, such as copyfield or dynamic fields
(see http://wiki.apache.org/solr/SchemaXml), would lead to a single
field type and therefore a single type of stemming. You could always use
copyfield to create an unstemmed common index, if you don't care about
stemming when you search across languages (since you're likely to get
odd results when a query in one language is stemmed according to the
rules of another language).

Peter

-----Original Message-----
From: Eli K [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 05, 2008 8:27 AM
To: solr-user@lucene.apache.org
Subject: multi-language searching with Solr

Hello folks,

Let me start by saying that I am new to Lucene and Solr.

I am in the process of designing a search back-end for a system that
receives 20k documents a day and needs to keep them available for 30
days.  The documents should be searchable on a free text field and on
about 8 other fields.

One of my requirements is to index and search documents in multiple
languages.  I would like to have the ability to stem and provide the
advanced search features that are based on it.  This will only affect
the free text field because the rest of the fields are in English.

I can find out the language of the document before indexing and I might
be able to provide the language to search on.  I also need to have the
ability to search across all indexed languages (there will be 20 in
total).

Given these requirements do you think this is doable with Solr?  A major
limiting factor is that I need to stick to the 1.2 GA version and I
cannot utilize the multi-core features in the 1.3 trunk.

I considered writing my own analyzer that will call the appropriate
Lucene analyzer for the given language but I did not see any way for it
to access the field that specifies the language of the document.

Thanks,

Eli

p.s. I am looking for an experienced Lucene/Solr consultant to help with
the design of this system.

Reply via email to