Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Eduard Moraru Wed, 28 Nov 2012 07:52:40 -0800

Hi Jerome,

Thanks for the reply. Please read below...


On Tue, Nov 27, 2012 at 5:27 PM, Jerome Velociter <jer...@velociter.fr>wrote:

> Hi Edy,
>
>
> On 11/26/2012 07:25 PM, Eduard Moraru wrote:
>
>> Hi devs,
>>
>> Any other input on this matter?
>>
>> To summarize a bit, if we go with the multiple fields for each language,
>> we
>> end up with an index like:
>>
>> English version:
>> id: xwiki:Main.SomeDocument_en
>> language: en
>> space: Main
>> title_en: XWiki document
>> doccontent_en: This is some content
>>
>> French version:
>> id: xwiki:Main.SomeDocument_fr
>> language: fr
>> space: Main
>> title_fr: XWiki document
>> doccontent_fr: This is some content
>>
>> The Solr configuration is generated by some XWiki UI that returns a zip
>> that the admin has to unpack in his (remote) Solr instance. This could be
>> automated for the embedded instance.
>>
>
> IMHO this is a "must", not a "could" for embedded instances.
>
>
>    This operation is to be performed each
>> time an admin changes the indexed languages (rarely or even only once).
>>
>
> There could be a reminder in the admin language UI.
>
>
>
>> Querying such a schema is a bit tricky when you are interested in more
>> than
>> one language, because you have to add all the clauses (title_en, title_fr,
>> etc.) specific to the languages you are interested in.
>>
>
> But that's an exotic use case already I think. The common case is to query
> for the context language only.
>

Even so, the developer would still have to do something like
"title_${xcontext.language}:something AND
doccontent_${xcontext.language}:somethingElse ..." which is not really nice.


>
>
>> Some extra fields might also be added like title_ws (for whitespace
>> tokenization only) that have various approaches to the indexing operation,
>> with the aim of improving the relevancy.
>>
>> One solution to simplify the query for API clients would be to use fields
>> like "title" and "doccontent" and to put as values very lightly (or not at
>> all) analyzed content, as Paul suggested. This would allow applications to
>> write simple (and backwards compatible maybe) queries that will still
>> work,
>> but will not catch some of the nuances of specific languages. As far as
>> I`ve seen until now, applications are not very interested in nuances, but
>> rather in filtering the results, a task for which this solution might be
>> well suited. Of course, nothing stops applications from using the *new*
>> and
>> more expressive fields that are properly analized.
>>
>> Thus, the search application will be the major beneficiary of these
>> analyzed fields (title_en, title_fr, etc.), while still allowing
>> applications to get their job done (trough generic, but less/not analized
>> fields like "title", "doccontent", etc.).
>>
>
> I think for applications this complexity/implementations details would
> benefit being hidden behind a "query builder" interface of some sort, WDYT ?
>

I also thought about something in that direction. The current query logic
uses XWiki`s query API, so the query processing part would have to be done
inside the query.execute() method. Something like expanding multilingual
fields:

"title:something AND doccontent:somethingElse AND lang:en" becomes
"(doccontent_en:something OR doccontent_fr:something OR ...) AND
(title_en:something OR title_fr:something OR ...) AND lang:en"  (based on
the currently supported -- or just indexed -- languages)

Maybe we could also allow a parameter that disables this default action for
queries that want to be explicit and don`t want this field expansion.

I wonder if such a simple solution would fix our query issues or if it will
create new ones that we don`t yet know about :)

Thanks,
Eduard

>
> Jerome
>
>
>
>> WDYT?
>>
>> Thanks,
>> Eduard
>>
>>
>>
>>
>> On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru <enygma2...@gmail.com
>> >wrote:
>>
>>  Hi Paul,
>>>
>>> I was counting on your feedback :)
>>>
>>> On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht <p...@hoplahup.net>
>>> wrote:
>>>
>>>  Hello Eduard,
>>>>
>>>> it's nice of you to see you take this further.
>>>>
>>>>  This issue has already been previously [1] discussed during the GSoC
>>>>> project, but I am not particularly happy with the chosen approach.
>>>>> When handling multiple languages, there are generally[2][3] 3 different
>>>>> approaches:
>>>>>
>>>>> 1) Indexing the content in a single field (like title, doccontent,
>>>>> etc.)
>>>>> - This has the advantage that queries are clear and fast
>>>>> - The disadvantage is that you can not run very well tuned analyzers on
>>>>>
>>>> the
>>>>
>>>>> fields, having to resort to (at best) basic tokenization and
>>>>>
>>>> lowercasing.
>>>>
>>>>> 2) Indexing the content in multiple fields, one field for each language
>>>>> (like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
>>>>> - This has the advantage that you can easily specify (as dynamic
>>>>> fields)
>>>>> that *_en fields are of type text_en (and analyzed by an
>>>>>
>>>> english-centered
>>>>
>>>>> chain of analyzers); *_fr of type text_fr (focused on french, etc.),
>>>>>
>>>> thus
>>>>
>>>>> making the results much better.
>>>>>
>>>> I would add one more field here: title_ws and text_ws where the full
>>>> text
>>>> is analyzed just as words (using the whitespace-tokenizer?).
>>>> A match there would even be preferred to a match in the below
>>>> text-fields.
>>>>
>>>> (maybe that would be called title and text?)
>>>>
>>>>  - The disadvantage is that querying such a schema is a pain. If you
>>>>> want
>>>>> all the results in all languages, you end up with a big and expensive
>>>>> query.
>>>>>
>>>> Why is this an issue?
>>>> Dismax does it for you for free (thanks to the "form" parameter that
>>>> gives weight to each of the fields).
>>>> This is an issue only if you start to have more than 100 languages or
>>>> so...
>>>> Lucene, the underlying engine of solr, handles thousands of clauses in a
>>>> query without an issue (this is how prefix-queries are handled... they
>>>> are
>>>> expanded to a query for any of the term that matches the prefix, a
>>>> setting
>>>> deep somewhere, which is about 2000 avoids this to explode).
>>>>
>>>>  Sure, Solr is great when you want to do simple queries like "XWiki Open
>>> Source", however, since in XWiki we also expose the Solr/Lucene query
>>> APIs
>>> to the platform, there will be (as as it is currently with Lucene) a lot
>>> of
>>> extensions wanting to do search using this API. These extensions (like
>>> the
>>> search suggest for example, rest search, etc) want to do something like
>>> "title:'Open Source' AND type:document AND doccontent:XWiki". Because
>>> option 2) is so messy in it's fields, it would mean that the extension
>>> would have to come up with a query like "title_en:'Open Source' AND
>>> type:document AND doccontent_en:XWiki" (assuming that it is only limited
>>> to
>>> the current -- english or whatever -- language; what happens if it wants
>>> to
>>> do that no matter what language? It will have to specify each combination
>>> possible because we can't use generic field names).
>>>
>>> Solr's approach works for using it in your web application's search
>>> input,
>>> in a specific usecase, where you have precisely specified the default
>>> search fields and their boosts inside your schema.xml. However, as a
>>> search
>>> API, using option 2) you are making the life of anyone else wanting to
>>> use
>>> the Solr search API really hard. Also, your search application will work
>>> nicely when the user enters a simple query in the input field, but an
>>> advanced user will suffer the same fate when trying to write an advanced
>>> query, thus not relying on the default query (computed by solr based on
>>> schema.xml).
>>>
>>> Also, based on your note above regarding improvements like title_ws and
>>> such, again, all of these are very well suited for the search application
>>> use case, together with the default query that you configure in
>>> schema.xml,
>>> making the search results perform really well. However, what does all
>>> these
>>> fields mean to another extension wanting to do search? Will it have to
>>> handle all these implementation details to query for title, content and
>>> such? I`m not sure how well this would work in practice.
>>>
>>> Unrealistic idea(?): perhaps we should come up with an abstract search
>>> language (Solr/Lucene clone) that parses the searched fields andhides the
>>> complexities of all the indexed fields, allowing to write simple queries
>>> like "title:XWiki", while this gets translated to "title_en:XWiki OR
>>> title_fr:XWiki OR title_de:XWiki..." :)
>>>
>>> Am I approaching this wrong by trying to have both a tweakable/tweaked
>>> search application AND a search API? Are the two not compatible? Do we
>>> have
>>> to sacrifice search result performance (no language-specific stuff) to be
>>> able to have a usable API?
>>>
>>>
>>>  If you want just some language, you have to read the right fields
>>>>> (ex title_en) instead of just getting a clear field name (title).
>>>>>
>>>> You have to be careful, this is really only if you want to be specific.
>>>> In this case, it is likely that you also do not want so much stemming.
>>>> My experience, which was before dismax on curriki.org, has made it so
>>>> that any query that is a bit specific is likely to not desire stemming.
>>>>
>>>>  Can you please elaborate on this? I`m not sure I understand the
>>> problem.
>>>
>>>
>>>  -- Also, the schema.xml definition is a static one in this concern,
>>>>> requiring you to know beforehand which languages you want to support
>>>>>
>>>> (for
>>>>
>>>>> example when defining the default fields to search for). Adding a new
>>>>> language requires you to start editing the xml files by hand.
>>>>>
>>>> True but the available languages are almost all hand-coded.
>>>> You could generate the schema.xml based on the available languages if
>>>> not
>>>> hand-generated?
>>>>
>>>>  Basically I would have to output a zip with schema.xml, solrconfig.xml
>>> and
>>> then all the resources specific to all the selected languages (stopwords,
>>> synonims, etc) for the languages that we can provide out of the box. For
>>> other languages, the admin would have to get dirty with the xmls.
>>>
>>>
>>>  There's one catch with this approach which is new to me but seems to be
>>>> quite important to implement this approach: the idf should be modified,
>>>> the
>>>> Similarity class should be, so that the total number of documents is the
>>>> total number of documents having that language.
>>>> See:
>>>>
>>>> http://mail-archives.apache.**org/mod_mbox/lucene-solr-user/**
>>>> 201211.mbox/%3Czarafa.**509ccb61.698a.**1d02345614818807@mail.**
>>>> openindex.io%3E<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%3czarafa.509ccb61.698a.1d02345614818...@mail.openindex.io%3E>
>>>> The solution sketched there sounds easy but I have not tried it.
>>>>
>>>>  3) Indexing the content in different Solr cores (indexes), one for each
>>>>> language. Each core requires it's on directory and configuration files.
>>>>> - The advantage is that queries are clean to write (like option 1) and
>>>>>
>>>> that
>>>>
>>>>> you have a nice separation
>>>>> - The disadvantage is that it's difficult to get it right
>>>>>
>>>> (administrative
>>>>
>>>>> issues) and then you also have the (considerable) problem of having to
>>>>>
>>>> fix
>>>>
>>>>> the relevancy score of a query result that has entries from different
>>>>> cores; each core has it's own relevancy computed and does not consider
>>>>>
>>>> the
>>>>
>>>>> others.
>>>>> - To make it even worst, it seems that you can not [5] also push to a
>>>>> remote Solr instance the configuration files when creating a new core
>>>>> programatically. However, if we are running an embedded Solr instance,
>>>>>
>>>> we
>>>>
>>>>> could provide a way to generate the config files and write them to the
>>>>>
>>>> data
>>>>
>>>>> directory.
>>>>>
>>>> Post-processing results is very very very dangerous as performance is at
>>>> risk (e.g. if a core does not answer)... I would tend to avoid that as
>>>> much
>>>> as possible.
>>>>
>>>>  Not really related, but this reminds me about the post processing that
>>> I
>>> do for checking view rights over the returned result, but that's another
>>> discussion that we will probably need to have :)
>>>
>>>
>>>  Currently I have implemented option 1) in our existing Solr integration,
>>>>> which is also more or less compatible with our existing Lucene queries,
>>>>>
>>>> but
>>>>
>>>>> I would like to find a better solution that actually analyses the
>>>>>
>>>> content.
>>>>
>>>>> During GSoC, option 2) was preferred but the implementation did not
>>>>> consider practical reasons like the ones described above (query
>>>>>
>>>> complexity,
>>>>
>>>>> user configuration, etc.)
>>>>>
>>>> True, Savitha surfed the possibility of having different solr documents
>>>> per language.
>>>> I still could not be sure that this was not showing the document match
>>>> single in one language.
>>>>
>>>> However, indicating which language it is matched into is probably
>>>> useful...
>>>>
>>>>  Already doing that.
>>>
>>>
>>>  Funnily, cross-language-retrieval is a mature research field but
>>>> retrieval for multilanguage user is not so!
>>>>
>>>>  On a related note, I have also watched an interesting presentation [3]
>>>>> about how Drupal handles its Solr integration and, particularly, a
>>>>>
>>>> plugin
>>>>
>>>>> [4] that handles the multilingual aspect.
>>>>> The idea seen there is that you have this UI that helps you generate
>>>>> configuration files, depending you your needs. For instance, you
>>>>> (admin)
>>>>> check that you need search for language English, French and German and
>>>>>
>>>> the
>>>>
>>>>> ui/extension gives you a zip with the configuration you need to use in
>>>>>
>>>> your
>>>>
>>>>> (remote or embedded) solr instance. The configuration for each language
>>>>> comes preset with the analyzers you should use for it and the
>>>>> additional
>>>>> resources (stopwords.txt, synonims.txt, etc.).
>>>>> This approach helps with avoiding the need for admins to be forced to
>>>>>
>>>> edit
>>>>
>>>>> xml files and could also still be useful for other cases, not only
>>>>>
>>>> option
>>>>
>>>>> 2).
>>>>>
>>>> Generating sounds like an easy approach to me.
>>>>
>>>>  Yes, however I don`t like the fact that we can not do everything from
>>> the
>>> webapp and the admin needs to access the filesystem to install the given
>>> configuration on the embedded/remote solr directory. Lucene does not have
>>> this problem now. It just works with XWiki and everything is done from
>>> XWiki UI. I feel that losing this commodity will not be very well
>>> received
>>> by users that now have some new install steps to get XWiki running.
>>>
>>> Well, of course, for the embedded solr version, we could handle it like
>>> we
>>> do now and push the files directly from the webapp to the filesystem.
>>> Since
>>> embedded will be default, it should be OK and avoid the extra install
>>> step.
>>> Users with a remote solr machine should have the option to get the zip
>>> instead.
>>>
>>> Not sure if we can apply the new configuration without a restart, but
>>> I`ll
>>> have to look more into it. I know the multi-core architecture supports
>>> something like this but will have to see the details.
>>>
>>>
>>>  All these problems basically come from the fact that there is no way to
>>>>> specify in the schema.xml that, based on the value of a field (like the
>>>>> field "lang" that stores the document language), you want to run this
>>>>> or
>>>>> that group of analyzers.
>>>>>
>>>> Well, this is possible with ThreadLocal but is not necessarily a good
>>>> idea.
>>>> Also, it is very common that users formulate queries without formulating
>>>> their language and thus you need to "or" the user's queries through
>>>> multiple languages (e.g. given by the browser).
>>>>
>>>>  Perhaps a solution would be a custom kind of "AggregatorAnalyzer" that
>>>>> would call other analyzers at runtime, based on the value of the lang
>>>>> field. However, this solution could only be applied at index time, when
>>>>>
>>>> you
>>>>
>>>>> have the lang information (in the solrDocument to be indexed), but when
>>>>>
>>>> you
>>>>
>>>>> perform the query, you can not analyze the query text since you do not
>>>>>
>>>> know
>>>>
>>>>> the language of the field you're querying (it was determined at runtime
>>>>>
>>>> -
>>>>
>>>>> at index time) and thus do not know what operations to apply to the
>>>>>
>>>> query
>>>>
>>>>> (to reduce it to the same form as the indexed values).
>>>>>
>>>> How would that look at query time?
>>>>
>>>>  That's what I was saying, that at query time, the searched term will
>>> not
>>> get analyzed by the right chain. When you search for a single language,
>>> you
>>> could add that language as a query filter and then you could apply the
>>> right chain, but when searching in 2 or more (or no, meaning all)
>>> languages
>>> you are stuck.
>>>
>>>  I have also read another interesting analysis [6] on this problem that
>>>>> elaborates on the complexities and limitations of each options. (Ignore
>>>>>
>>>> the
>>>>
>>>>> Rosette stuff mentioned there)
>>>>>
>>>>> I have been thinking about this for some time now, but the solution is
>>>>> probably somewhere in between, finding an option that is acceptable
>>>>>
>>>> while
>>>>
>>>>> not restrictive. I will probably also send a mail on the Solr list to
>>>>>
>>>> get
>>>>
>>>>> some more input from there, but I get the feeling that whatever
>>>>>
>>>> solution we
>>>>
>>>>> choose, it will most likely require the users to at least copy (or even
>>>>> edit) some files into some directories (configurations and/or jars),
>>>>>
>>>> since
>>>>
>>>>> it does not seem to be easy/possible to do everything on-the-fly,
>>>>> programatically.
>>>>>
>>>> The only hard step is when changing the supported languages, I think.
>>>> In this case, when automatically generating the index, you need to warn
>>>> the user.
>>>> The admin UI should have a checkbox "use generated schema" or a textarea
>>>> for the schema.
>>>>
>>>>  Please see above regarding configuration generation. Basically, since
>>> we
>>> are going to support both embedded and remote solr instances, we could
>>> support things like editing the schema from XWiki only for the embedded
>>> instance, but not for the remote one. We might end up having separate UIs
>>> for each case, since we might want to exploit the flexibility of the
>>> embedded one as much as possible.
>>>
>>>
>>>  Those that want particular fields and tunings need to write their own
>>>> schema.
>>>>
>>>> The same UI could also include whether to include a phonetic track or
>>>> not
>>>> (then require reindexing).
>>>>
>>>
>>>  hope it helps.
>>>>
>>>>  Yes, very helpful so far. I`m counting on your expertise with
>>> Lucene/Solr
>>> on the details. My current approach is a practical one without previous
>>> experience on the topic, so I`m still doing mostly guesswork in some
>>> areas.
>>>
>>> Thanks,
>>> Eduard
>>>
>>>
>>>  paul
>>>> ______________________________**_________________
>>>> devs mailing list
>>>> devs@xwiki.org
>>>> http://lists.xwiki.org/**mailman/listinfo/devs<http://lists.xwiki.org/mailman/listinfo/devs>
>>>>
>>>>
>>>  ______________________________**_________________
>> devs mailing list
>> devs@xwiki.org
>> http://lists.xwiki.org/**mailman/listinfo/devs<http://lists.xwiki.org/mailman/listinfo/devs>
>>
>
> ______________________________**_________________
> devs mailing list
> devs@xwiki.org
> http://lists.xwiki.org/**mailman/listinfo/devs<http://lists.xwiki.org/mailman/listinfo/devs>
>
_______________________________________________
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Reply via email to