Hi devs, This issue has already been previously [1] discussed during the GSoC project, but I am not particularly happy with the chosen approach.
When handling multiple languages, there are generally[2][3] 3 different approaches: 1) Indexing the content in a single field (like title, doccontent, etc.) - This has the advantage that queries are clear and fast - The disadvantage is that you can not run very well tuned analyzers on the fields, having to resort to (at best) basic tokenization and lowercasing. 2) Indexing the content in multiple fields, one field for each language (like title_en, title_fr, doccontent_en, doccontent_fr, etc.) - This has the advantage that you can easily specify (as dynamic fields) that *_en fields are of type text_en (and analyzed by an english-centered chain of analyzers); *_fr of type text_fr (focused on french, etc.), thus making the results much better. - The disadvantage is that querying such a schema is a pain. If you want all the results in all languages, you end up with a big and expensive query. If you want just some language, you have to read the right fields (ex title_en) instead of just getting a clear field name (title). -- Also, the schema.xml definition is a static one in this concern, requiring you to know beforehand which languages you want to support (for example when defining the default fields to search for). Adding a new language requires you to start editing the xml files by hand. 3) Indexing the content in different Solr cores (indexes), one for each language. Each core requires it's on directory and configuration files. - The advantage is that queries are clean to write (like option 1) and that you have a nice separation - The disadvantage is that it's difficult to get it right (administrative issues) and then you also have the (considerable) problem of having to fix the relevancy score of a query result that has entries from different cores; each core has it's own relevancy computed and does not consider the others. - To make it even worst, it seems that you can not [5] also push to a remote Solr instance the configuration files when creating a new core programatically. However, if we are running an embedded Solr instance, we could provide a way to generate the config files and write them to the data directory. Currently I have implemented option 1) in our existing Solr integration, which is also more or less compatible with our existing Lucene queries, but I would like to find a better solution that actually analyses the content. During GSoC, option 2) was preferred but the implementation did not consider practical reasons like the ones described above (query complexity, user configuration, etc.) On a related note, I have also watched an interesting presentation [3] about how Drupal handles its Solr integration and, particularly, a plugin [4] that handles the multilingual aspect. The idea seen there is that you have this UI that helps you generate configuration files, depending you your needs. For instance, you (admin) check that you need search for language English, French and German and the ui/extension gives you a zip with the configuration you need to use in your (remote or embedded) solr instance. The configuration for each language comes preset with the analyzers you should use for it and the additional resources (stopwords.txt, synonims.txt, etc.). This approach helps with avoiding the need for admins to be forced to edit xml files and could also still be useful for other cases, not only option 2). All these problems basically come from the fact that there is no way to specify in the schema.xml that, based on the value of a field (like the field "lang" that stores the document language), you want to run this or that group of analyzers. Perhaps a solution would be a custom kind of "AggregatorAnalyzer" that would call other analyzers at runtime, based on the value of the lang field. However, this solution could only be applied at index time, when you have the lang information (in the solrDocument to be indexed), but when you perform the query, you can not analyze the query text since you do not know the language of the field you're querying (it was determined at runtime - at index time) and thus do not know what operations to apply to the query (to reduce it to the same form as the indexed values). I have also read another interesting analysis [6] on this problem that elaborates on the complexities and limitations of each options. (Ignore the Rosette stuff mentioned there) I have been thinking about this for some time now, but the solution is probably somewhere in between, finding an option that is acceptable while not restrictive. I will probably also send a mail on the Solr list to get some more input from there, but I get the feeling that whatever solution we choose, it will most likely require the users to at least copy (or even edit) some files into some directories (configurations and/or jars), since it does not seem to be easy/possible to do everything on-the-fly, programatically. Any input on this would be highly appreciated, specially if others have more experience with Solr setups. Thanks, Eduard ---------- [1] http://markmail.org/message/kaxaka7lsbgo57ms [2] http://lucidworks.lucidimagination.com/display/lweug/Multilingual+Indexing+and+Search [3] http://drupalcity.de/session/language-specific-and-multilingual-full-text-searching [4] http://drupal.org/project/apachesolr_multilingual [5] http://stackoverflow.com/questions/4064880/create-new-core-directories-in-solr-on-the-fly [6] http://info.basistech.com/blog/bid/171842/Indexing-Strategies-for-Multilingual-Search-with-Solr-and-Rosette _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

