Hello Eduard,

it's nice of you to see you take this further.

> This issue has already been previously [1] discussed during the GSoC
> project, but I am not particularly happy with the chosen approach.
> When handling multiple languages, there are generally[2][3] 3 different
> approaches:
> 
> 1) Indexing the content in a single field (like title, doccontent, etc.)
> - This has the advantage that queries are clear and fast
> - The disadvantage is that you can not run very well tuned analyzers on the
> fields, having to resort to (at best) basic tokenization and lowercasing.
> 
> 2) Indexing the content in multiple fields, one field for each language
> (like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
> - This has the advantage that you can easily specify (as dynamic fields)
> that *_en fields are of type text_en (and analyzed by an english-centered
> chain of analyzers); *_fr of type text_fr (focused on french, etc.), thus
> making the results much better.

I would add one more field here: title_ws and text_ws where the full text is 
analyzed just as words (using the whitespace-tokenizer?).
A match there would even be preferred to a match in the below text-fields.

(maybe that would be called title and text?)

> - The disadvantage is that querying such a schema is a pain. If you want
> all the results in all languages, you end up with a big and expensive
> query.

Why is this an issue?
Dismax does it for you for free (thanks to the "form" parameter that gives 
weight to each of the fields).
This is an issue only if you start to have more than 100 languages or so...
Lucene, the underlying engine of solr, handles thousands of clauses in a query 
without an issue (this is how prefix-queries are handled... they are expanded 
to a query for any of the term that matches the prefix, a setting deep 
somewhere, which is about 2000 avoids this to explode).

> If you want just some language, you have to read the right fields
> (ex title_en) instead of just getting a clear field name (title).

You have to be careful, this is really only if you want to be specific. In this 
case, it is likely that you also do not want so much stemming.
My experience, which was before dismax on curriki.org, has made it so that any 
query that is a bit specific is likely to not desire stemming.

> -- Also, the schema.xml definition is a static one in this concern,
> requiring you to know beforehand which languages you want to support (for
> example when defining the default fields to search for). Adding a new
> language requires you to start editing the xml files by hand.

True but the available languages are almost all hand-coded.
You could generate the schema.xml based on the available languages if not 
hand-generated?


There's one catch with this approach which is new to me but seems to be quite 
important to implement this approach: the idf should be modified, the 
Similarity class should be, so that the total number of documents is the total 
number of documents having that language.
See:
        
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%[email protected]%3E
The solution sketched there sounds easy but I have not tried it.

> 3) Indexing the content in different Solr cores (indexes), one for each
> language. Each core requires it's on directory and configuration files.
> - The advantage is that queries are clean to write (like option 1) and that
> you have a nice separation
> - The disadvantage is that it's difficult to get it right (administrative
> issues) and then you also have the (considerable) problem of having to fix
> the relevancy score of a query result that has entries from different
> cores; each core has it's own relevancy computed and does not consider the
> others.
> - To make it even worst, it seems that you can not [5] also push to a
> remote Solr instance the configuration files when creating a new core
> programatically. However, if we are running an embedded Solr instance, we
> could provide a way to generate the config files and write them to the data
> directory.

Post-processing results is very very very dangerous as performance is at risk 
(e.g. if a core does not answer)... I would tend to avoid that as much as 
possible.

> Currently I have implemented option 1) in our existing Solr integration,
> which is also more or less compatible with our existing Lucene queries, but
> I would like to find a better solution that actually analyses the content.
> 
> During GSoC, option 2) was preferred but the implementation did not
> consider practical reasons like the ones described above (query complexity,
> user configuration, etc.)

True, Savitha surfed the possibility of having different solr documents per 
language.
I still could not be sure that this was not showing the document match single 
in one language. 

However, indicating which language it is matched into is probably useful...
Funnily, cross-language-retrieval is a mature research field but retrieval for 
multilanguage user is not so!

> On a related note, I have also watched an interesting presentation [3]
> about how Drupal handles its Solr integration and, particularly, a plugin
> [4] that handles the multilingual aspect.
> The idea seen there is that you have this UI that helps you generate
> configuration files, depending you your needs. For instance, you (admin)
> check that you need search for language English, French and German and the
> ui/extension gives you a zip with the configuration you need to use in your
> (remote or embedded) solr instance. The configuration for each language
> comes preset with the analyzers you should use for it and the additional
> resources (stopwords.txt, synonims.txt, etc.).
> This approach helps with avoiding the need for admins to be forced to edit
> xml files and could also still be useful for other cases, not only option
> 2).

Generating sounds like an easy approach to me.

> All these problems basically come from the fact that there is no way to
> specify in the schema.xml that, based on the value of a field (like the
> field "lang" that stores the document language), you want to run this or
> that group of analyzers.

Well, this is possible with ThreadLocal but is not necessarily a good idea.
Also, it is very common that users formulate queries without formulating their 
language and thus you need to "or" the user's queries through multiple 
languages (e.g. given by the browser).

> Perhaps a solution would be a custom kind of "AggregatorAnalyzer" that
> would call other analyzers at runtime, based on the value of the lang
> field. However, this solution could only be applied at index time, when you
> have the lang information (in the solrDocument to be indexed), but when you
> perform the query, you can not analyze the query text since you do not know
> the language of the field you're querying (it was determined at runtime -
> at index time) and thus do not know what operations to apply to the query
> (to reduce it to the same form as the indexed values).

How would that look at query time?

> I have also read another interesting analysis [6] on this problem that
> elaborates on the complexities and limitations of each options. (Ignore the
> Rosette stuff mentioned there)
> 
> I have been thinking about this for some time now, but the solution is
> probably somewhere in between, finding an option that is acceptable while
> not restrictive. I will probably also send a mail on the Solr list to get
> some more input from there, but I get the feeling that whatever solution we
> choose, it will most likely require the users to at least copy (or even
> edit) some files into some directories (configurations and/or jars), since
> it does not seem to be easy/possible to do everything on-the-fly,
> programatically.

The only hard step is when changing the supported languages, I think.
In this case, when automatically generating the index, you need to warn the 
user.
The admin UI should have a checkbox "use generated schema" or a textarea for 
the schema.

Those that want particular fields and tunings need to write their own schema.

The same UI could also include whether to include a phonetic track or not (then 
require reindexing).

hope it helps.
paul
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to