Hi Jerome, Thanks for the reply. Please read below...
On Tue, Nov 27, 2012 at 5:27 PM, Jerome Velociter <jer...@velociter.fr>wrote: > Hi Edy, > > > On 11/26/2012 07:25 PM, Eduard Moraru wrote: > >> Hi devs, >> >> Any other input on this matter? >> >> To summarize a bit, if we go with the multiple fields for each language, >> we >> end up with an index like: >> >> English version: >> id: xwiki:Main.SomeDocument_en >> language: en >> space: Main >> title_en: XWiki document >> doccontent_en: This is some content >> >> French version: >> id: xwiki:Main.SomeDocument_fr >> language: fr >> space: Main >> title_fr: XWiki document >> doccontent_fr: This is some content >> >> The Solr configuration is generated by some XWiki UI that returns a zip >> that the admin has to unpack in his (remote) Solr instance. This could be >> automated for the embedded instance. >> > > IMHO this is a "must", not a "could" for embedded instances. > > > This operation is to be performed each >> time an admin changes the indexed languages (rarely or even only once). >> > > There could be a reminder in the admin language UI. > > > >> Querying such a schema is a bit tricky when you are interested in more >> than >> one language, because you have to add all the clauses (title_en, title_fr, >> etc.) specific to the languages you are interested in. >> > > But that's an exotic use case already I think. The common case is to query > for the context language only. > Even so, the developer would still have to do something like "title_${xcontext.language}:something AND doccontent_${xcontext.language}:somethingElse ..." which is not really nice. > > >> Some extra fields might also be added like title_ws (for whitespace >> tokenization only) that have various approaches to the indexing operation, >> with the aim of improving the relevancy. >> >> One solution to simplify the query for API clients would be to use fields >> like "title" and "doccontent" and to put as values very lightly (or not at >> all) analyzed content, as Paul suggested. This would allow applications to >> write simple (and backwards compatible maybe) queries that will still >> work, >> but will not catch some of the nuances of specific languages. As far as >> I`ve seen until now, applications are not very interested in nuances, but >> rather in filtering the results, a task for which this solution might be >> well suited. Of course, nothing stops applications from using the *new* >> and >> more expressive fields that are properly analized. >> >> Thus, the search application will be the major beneficiary of these >> analyzed fields (title_en, title_fr, etc.), while still allowing >> applications to get their job done (trough generic, but less/not analized >> fields like "title", "doccontent", etc.). >> > > I think for applications this complexity/implementations details would > benefit being hidden behind a "query builder" interface of some sort, WDYT ? > I also thought about something in that direction. The current query logic uses XWiki`s query API, so the query processing part would have to be done inside the query.execute() method. Something like expanding multilingual fields: "title:something AND doccontent:somethingElse AND lang:en" becomes "(doccontent_en:something OR doccontent_fr:something OR ...) AND (title_en:something OR title_fr:something OR ...) AND lang:en" (based on the currently supported -- or just indexed -- languages) Maybe we could also allow a parameter that disables this default action for queries that want to be explicit and don`t want this field expansion. I wonder if such a simple solution would fix our query issues or if it will create new ones that we don`t yet know about :) Thanks, Eduard > > Jerome > > > >> WDYT? >> >> Thanks, >> Eduard >> >> >> >> >> On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru <enygma2...@gmail.com >> >wrote: >> >> Hi Paul, >>> >>> I was counting on your feedback :) >>> >>> On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht <p...@hoplahup.net> >>> wrote: >>> >>> Hello Eduard, >>>> >>>> it's nice of you to see you take this further. >>>> >>>> This issue has already been previously [1] discussed during the GSoC >>>>> project, but I am not particularly happy with the chosen approach. >>>>> When handling multiple languages, there are generally[2][3] 3 different >>>>> approaches: >>>>> >>>>> 1) Indexing the content in a single field (like title, doccontent, >>>>> etc.) >>>>> - This has the advantage that queries are clear and fast >>>>> - The disadvantage is that you can not run very well tuned analyzers on >>>>> >>>> the >>>> >>>>> fields, having to resort to (at best) basic tokenization and >>>>> >>>> lowercasing. >>>> >>>>> 2) Indexing the content in multiple fields, one field for each language >>>>> (like title_en, title_fr, doccontent_en, doccontent_fr, etc.) >>>>> - This has the advantage that you can easily specify (as dynamic >>>>> fields) >>>>> that *_en fields are of type text_en (and analyzed by an >>>>> >>>> english-centered >>>> >>>>> chain of analyzers); *_fr of type text_fr (focused on french, etc.), >>>>> >>>> thus >>>> >>>>> making the results much better. >>>>> >>>> I would add one more field here: title_ws and text_ws where the full >>>> text >>>> is analyzed just as words (using the whitespace-tokenizer?). >>>> A match there would even be preferred to a match in the below >>>> text-fields. >>>> >>>> (maybe that would be called title and text?) >>>> >>>> - The disadvantage is that querying such a schema is a pain. If you >>>>> want >>>>> all the results in all languages, you end up with a big and expensive >>>>> query. >>>>> >>>> Why is this an issue? >>>> Dismax does it for you for free (thanks to the "form" parameter that >>>> gives weight to each of the fields). >>>> This is an issue only if you start to have more than 100 languages or >>>> so... >>>> Lucene, the underlying engine of solr, handles thousands of clauses in a >>>> query without an issue (this is how prefix-queries are handled... they >>>> are >>>> expanded to a query for any of the term that matches the prefix, a >>>> setting >>>> deep somewhere, which is about 2000 avoids this to explode). >>>> >>>> Sure, Solr is great when you want to do simple queries like "XWiki Open >>> Source", however, since in XWiki we also expose the Solr/Lucene query >>> APIs >>> to the platform, there will be (as as it is currently with Lucene) a lot >>> of >>> extensions wanting to do search using this API. These extensions (like >>> the >>> search suggest for example, rest search, etc) want to do something like >>> "title:'Open Source' AND type:document AND doccontent:XWiki". Because >>> option 2) is so messy in it's fields, it would mean that the extension >>> would have to come up with a query like "title_en:'Open Source' AND >>> type:document AND doccontent_en:XWiki" (assuming that it is only limited >>> to >>> the current -- english or whatever -- language; what happens if it wants >>> to >>> do that no matter what language? It will have to specify each combination >>> possible because we can't use generic field names). >>> >>> Solr's approach works for using it in your web application's search >>> input, >>> in a specific usecase, where you have precisely specified the default >>> search fields and their boosts inside your schema.xml. However, as a >>> search >>> API, using option 2) you are making the life of anyone else wanting to >>> use >>> the Solr search API really hard. Also, your search application will work >>> nicely when the user enters a simple query in the input field, but an >>> advanced user will suffer the same fate when trying to write an advanced >>> query, thus not relying on the default query (computed by solr based on >>> schema.xml). >>> >>> Also, based on your note above regarding improvements like title_ws and >>> such, again, all of these are very well suited for the search application >>> use case, together with the default query that you configure in >>> schema.xml, >>> making the search results perform really well. However, what does all >>> these >>> fields mean to another extension wanting to do search? Will it have to >>> handle all these implementation details to query for title, content and >>> such? I`m not sure how well this would work in practice. >>> >>> Unrealistic idea(?): perhaps we should come up with an abstract search >>> language (Solr/Lucene clone) that parses the searched fields andhides the >>> complexities of all the indexed fields, allowing to write simple queries >>> like "title:XWiki", while this gets translated to "title_en:XWiki OR >>> title_fr:XWiki OR title_de:XWiki..." :) >>> >>> Am I approaching this wrong by trying to have both a tweakable/tweaked >>> search application AND a search API? Are the two not compatible? Do we >>> have >>> to sacrifice search result performance (no language-specific stuff) to be >>> able to have a usable API? >>> >>> >>> If you want just some language, you have to read the right fields >>>>> (ex title_en) instead of just getting a clear field name (title). >>>>> >>>> You have to be careful, this is really only if you want to be specific. >>>> In this case, it is likely that you also do not want so much stemming. >>>> My experience, which was before dismax on curriki.org, has made it so >>>> that any query that is a bit specific is likely to not desire stemming. >>>> >>>> Can you please elaborate on this? I`m not sure I understand the >>> problem. >>> >>> >>> -- Also, the schema.xml definition is a static one in this concern, >>>>> requiring you to know beforehand which languages you want to support >>>>> >>>> (for >>>> >>>>> example when defining the default fields to search for). Adding a new >>>>> language requires you to start editing the xml files by hand. >>>>> >>>> True but the available languages are almost all hand-coded. >>>> You could generate the schema.xml based on the available languages if >>>> not >>>> hand-generated? >>>> >>>> Basically I would have to output a zip with schema.xml, solrconfig.xml >>> and >>> then all the resources specific to all the selected languages (stopwords, >>> synonims, etc) for the languages that we can provide out of the box. For >>> other languages, the admin would have to get dirty with the xmls. >>> >>> >>> There's one catch with this approach which is new to me but seems to be >>>> quite important to implement this approach: the idf should be modified, >>>> the >>>> Similarity class should be, so that the total number of documents is the >>>> total number of documents having that language. >>>> See: >>>> >>>> http://mail-archives.apache.**org/mod_mbox/lucene-solr-user/** >>>> 201211.mbox/%3Czarafa.**509ccb61.698a.**1d02345614818807@mail.** >>>> openindex.io%3E<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%3czarafa.509ccb61.698a.1d02345614818...@mail.openindex.io%3E> >>>> The solution sketched there sounds easy but I have not tried it. >>>> >>>> 3) Indexing the content in different Solr cores (indexes), one for each >>>>> language. Each core requires it's on directory and configuration files. >>>>> - The advantage is that queries are clean to write (like option 1) and >>>>> >>>> that >>>> >>>>> you have a nice separation >>>>> - The disadvantage is that it's difficult to get it right >>>>> >>>> (administrative >>>> >>>>> issues) and then you also have the (considerable) problem of having to >>>>> >>>> fix >>>> >>>>> the relevancy score of a query result that has entries from different >>>>> cores; each core has it's own relevancy computed and does not consider >>>>> >>>> the >>>> >>>>> others. >>>>> - To make it even worst, it seems that you can not [5] also push to a >>>>> remote Solr instance the configuration files when creating a new core >>>>> programatically. However, if we are running an embedded Solr instance, >>>>> >>>> we >>>> >>>>> could provide a way to generate the config files and write them to the >>>>> >>>> data >>>> >>>>> directory. >>>>> >>>> Post-processing results is very very very dangerous as performance is at >>>> risk (e.g. if a core does not answer)... I would tend to avoid that as >>>> much >>>> as possible. >>>> >>>> Not really related, but this reminds me about the post processing that >>> I >>> do for checking view rights over the returned result, but that's another >>> discussion that we will probably need to have :) >>> >>> >>> Currently I have implemented option 1) in our existing Solr integration, >>>>> which is also more or less compatible with our existing Lucene queries, >>>>> >>>> but >>>> >>>>> I would like to find a better solution that actually analyses the >>>>> >>>> content. >>>> >>>>> During GSoC, option 2) was preferred but the implementation did not >>>>> consider practical reasons like the ones described above (query >>>>> >>>> complexity, >>>> >>>>> user configuration, etc.) >>>>> >>>> True, Savitha surfed the possibility of having different solr documents >>>> per language. >>>> I still could not be sure that this was not showing the document match >>>> single in one language. >>>> >>>> However, indicating which language it is matched into is probably >>>> useful... >>>> >>>> Already doing that. >>> >>> >>> Funnily, cross-language-retrieval is a mature research field but >>>> retrieval for multilanguage user is not so! >>>> >>>> On a related note, I have also watched an interesting presentation [3] >>>>> about how Drupal handles its Solr integration and, particularly, a >>>>> >>>> plugin >>>> >>>>> [4] that handles the multilingual aspect. >>>>> The idea seen there is that you have this UI that helps you generate >>>>> configuration files, depending you your needs. For instance, you >>>>> (admin) >>>>> check that you need search for language English, French and German and >>>>> >>>> the >>>> >>>>> ui/extension gives you a zip with the configuration you need to use in >>>>> >>>> your >>>> >>>>> (remote or embedded) solr instance. The configuration for each language >>>>> comes preset with the analyzers you should use for it and the >>>>> additional >>>>> resources (stopwords.txt, synonims.txt, etc.). >>>>> This approach helps with avoiding the need for admins to be forced to >>>>> >>>> edit >>>> >>>>> xml files and could also still be useful for other cases, not only >>>>> >>>> option >>>> >>>>> 2). >>>>> >>>> Generating sounds like an easy approach to me. >>>> >>>> Yes, however I don`t like the fact that we can not do everything from >>> the >>> webapp and the admin needs to access the filesystem to install the given >>> configuration on the embedded/remote solr directory. Lucene does not have >>> this problem now. It just works with XWiki and everything is done from >>> XWiki UI. I feel that losing this commodity will not be very well >>> received >>> by users that now have some new install steps to get XWiki running. >>> >>> Well, of course, for the embedded solr version, we could handle it like >>> we >>> do now and push the files directly from the webapp to the filesystem. >>> Since >>> embedded will be default, it should be OK and avoid the extra install >>> step. >>> Users with a remote solr machine should have the option to get the zip >>> instead. >>> >>> Not sure if we can apply the new configuration without a restart, but >>> I`ll >>> have to look more into it. I know the multi-core architecture supports >>> something like this but will have to see the details. >>> >>> >>> All these problems basically come from the fact that there is no way to >>>>> specify in the schema.xml that, based on the value of a field (like the >>>>> field "lang" that stores the document language), you want to run this >>>>> or >>>>> that group of analyzers. >>>>> >>>> Well, this is possible with ThreadLocal but is not necessarily a good >>>> idea. >>>> Also, it is very common that users formulate queries without formulating >>>> their language and thus you need to "or" the user's queries through >>>> multiple languages (e.g. given by the browser). >>>> >>>> Perhaps a solution would be a custom kind of "AggregatorAnalyzer" that >>>>> would call other analyzers at runtime, based on the value of the lang >>>>> field. However, this solution could only be applied at index time, when >>>>> >>>> you >>>> >>>>> have the lang information (in the solrDocument to be indexed), but when >>>>> >>>> you >>>> >>>>> perform the query, you can not analyze the query text since you do not >>>>> >>>> know >>>> >>>>> the language of the field you're querying (it was determined at runtime >>>>> >>>> - >>>> >>>>> at index time) and thus do not know what operations to apply to the >>>>> >>>> query >>>> >>>>> (to reduce it to the same form as the indexed values). >>>>> >>>> How would that look at query time? >>>> >>>> That's what I was saying, that at query time, the searched term will >>> not >>> get analyzed by the right chain. When you search for a single language, >>> you >>> could add that language as a query filter and then you could apply the >>> right chain, but when searching in 2 or more (or no, meaning all) >>> languages >>> you are stuck. >>> >>> I have also read another interesting analysis [6] on this problem that >>>>> elaborates on the complexities and limitations of each options. (Ignore >>>>> >>>> the >>>> >>>>> Rosette stuff mentioned there) >>>>> >>>>> I have been thinking about this for some time now, but the solution is >>>>> probably somewhere in between, finding an option that is acceptable >>>>> >>>> while >>>> >>>>> not restrictive. I will probably also send a mail on the Solr list to >>>>> >>>> get >>>> >>>>> some more input from there, but I get the feeling that whatever >>>>> >>>> solution we >>>> >>>>> choose, it will most likely require the users to at least copy (or even >>>>> edit) some files into some directories (configurations and/or jars), >>>>> >>>> since >>>> >>>>> it does not seem to be easy/possible to do everything on-the-fly, >>>>> programatically. >>>>> >>>> The only hard step is when changing the supported languages, I think. >>>> In this case, when automatically generating the index, you need to warn >>>> the user. >>>> The admin UI should have a checkbox "use generated schema" or a textarea >>>> for the schema. >>>> >>>> Please see above regarding configuration generation. Basically, since >>> we >>> are going to support both embedded and remote solr instances, we could >>> support things like editing the schema from XWiki only for the embedded >>> instance, but not for the remote one. We might end up having separate UIs >>> for each case, since we might want to exploit the flexibility of the >>> embedded one as much as possible. >>> >>> >>> Those that want particular fields and tunings need to write their own >>>> schema. >>>> >>>> The same UI could also include whether to include a phonetic track or >>>> not >>>> (then require reindexing). >>>> >>> >>> hope it helps. >>>> >>>> Yes, very helpful so far. I`m counting on your expertise with >>> Lucene/Solr >>> on the details. My current approach is a practical one without previous >>> experience on the topic, so I`m still doing mostly guesswork in some >>> areas. >>> >>> Thanks, >>> Eduard >>> >>> >>> paul >>>> ______________________________**_________________ >>>> devs mailing list >>>> devs@xwiki.org >>>> http://lists.xwiki.org/**mailman/listinfo/devs<http://lists.xwiki.org/mailman/listinfo/devs> >>>> >>>> >>> ______________________________**_________________ >> devs mailing list >> devs@xwiki.org >> http://lists.xwiki.org/**mailman/listinfo/devs<http://lists.xwiki.org/mailman/listinfo/devs> >> > > ______________________________**_________________ > devs mailing list > devs@xwiki.org > http://lists.xwiki.org/**mailman/listinfo/devs<http://lists.xwiki.org/mailman/listinfo/devs> > _______________________________________________ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs