Hi,

I have chosen the same approach as you, indexing content into text_<language> 
fields with custom analysis, and it works great. Solr does not have any 
overhead with this even if there are hundreds of languages, due to the 
schema-less nature of Lucene.

And if you know which language is being searched, you can select only those 
fields in question, and you'd still be as fast as the mono language case. But 
you'd only get documents in that language returned.

Say you want to match across languages, it could be you search for "obama" 
which would be written the same in all languages. How to achieve this? I see 
two approaches:
a) Seach across all languages with proper analysis, as you suggest qf=text_fr 
text_en^10 (you can even boost the preferred languages).
b) Index all content in a "text_all" field with no stemming involved and search 
qf=text_all (you will match "obama" in all languages but lose stemming)

My feeling is that a) would work if you have a limited set of languages, but b) 
might be necessary if you have dozens of languages to search across, due to 
reduced query performance with such a large disMax query.

Of course with a) there may be ambiguities that an english word gets stemmed 
and hits the same stem as a totally different french word - I don't have any 
hands on examples, but I'm sure the issue exists. Then it is probably better to 
search the other languages un-stemmed, like a hybrid approach:

c) Search the query language stemmed and all other unstemmed (qf=text_en^10 
text_all - giving increased recall)

The downside of a text_all field is you almost double the size of your index 
worst-case.

Then you have the issue of displaying the results in front end.
Which title do you pick? title_en or title_fr? Here, I also see two solutions 
and I have tried both:
1) Store a title_display which is stored, while the title_<language> fields are 
only indexed, not stored. Use the title_display in frontend
2) Make a wrapper around QueryResult class so when frontend asks for "title", 
you intelligently try to pull out title_XY where XY is pulled from documents 
"language" metadata.

I think which you choose depends on taste, each has its + and -

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. juli 2010, at 12.26, Saïd Radhouani wrote:

> Hi,
> 
> I know this topic has been treated many times in the (distant) past, but I 
> wonder whether there are new better practices/tendencies.
> 
> In my application, I'm dealing with documents in different languages. Each 
> document is monolingual; it has some fields containing free text and a set of 
> fields that do not require any text analysis. For the free text, we need to 
> make a specific analysis based of the language of the document.
> 
> I'm for the use of a single index for all the documents instead of one index 
> per language (any objection?). Thus, in schema.xml, I need to declare a 
> separate field for each language (text_fr, text_en, etc.), each with its own 
> appropriate analysis. Then, during the indexing, I need to assign the free 
> text content of each document to the appropriate field. Thus, for each 
> document, only one of the freetext fields would be populated.
> 
> My question is, at search time, what is the best solution to search against 
> the appropriate field?
> 
> I know that using dismax, we can define in "qf" the set the fields we want to 
> search against. e.g., <str name="qf"> text_fr text_en</str>
> 
> With this solution, does Solr choose the appropriate analysis for the query. 
> i.e., if a query is compared to a document having English free text (text_en 
> is populated), does Solr analyze the query as it was in English ?
> 
> One problem with this approach is that, each query will be compared to all 
> the available documents. i.e., a query in English would be compared to a 
> document in French. As I know, if we know the query language, this problem 
> can be avoided, either by searching against the appropriate field (e.g., 
> text_fr:query), or by using a filter to select only those documents having 
> English text. Am I correct? Or is there a better solution?
> 
> Thanks,
> -Saïd
> 


Reply via email to