On Thu, Nov 29, 2012 at 3:47 PM, Eduard Moraru <[email protected]> wrote:
> Hi, > > On Wed, Nov 28, 2012 at 5:35 PM, Ludovic Dubost <[email protected]> wrote: > > > 2012/11/28 Eduard Moraru <[email protected]> > > > > > Hi Ludovic, > > > > > > Thanks for the reply. Please read below... > > > > > > On Tue, Nov 27, 2012 at 5:44 PM, Ludovic Dubost <[email protected]> > > wrote: > > > > > > > Hi Edy, > > > > > > > > I'm not a huge fan of the title_fr title_en title_morelanguages > > approach > > > as > > > > indeed it seems to be quite complex at the query level. I was more > > > leaning > > > > towards multiple indexes if we can query them globally but I > understand > > > > this is complex too. > > > > > > > > Now let's see the use cases that are hugely important: > > > > > > > > 1/ Make sure that if you decide your wiki is monolingual: > > > > - the indexing uses the specific language analyzer > > > > - make sure the query uses the specific language analyzer > > > > - make sure the search looks in all content even if the language > > > setting > > > > of the document is wrongly set (consider all documents being of the > > > > specific language) > > > > > > > > > > You mean that, if the wiki is monolingual, we should ignore the > language > > > filter and hardcode it to "All languages", right? > > > > > > However, what would be the advantage of this? Why would be want to > > pollute > > > the results with irrelevant documents (caused by a probable recent > > > configuration change that went from multi-lingual to mono-lignual)? > > Wasn't > > > that the whole reason why the admin switched to mono-lingual? > > > > > > > > > If the user is monolingual, we can safely ignore the language setting of > > each document and only the "main" document will be shown anyway to the > > user. > > So we should make sure we search on ALL documents that are available to > the > > user. > > > > The user might have to "reindex" to make sure this is properly taken into > > account by the search engine. > > > > What is important here, is that even if the wiki is set to "fr", then > even > > if documents have "en" for the main language they will still show up in > the > > search. The opposite would be bad. > > > > Why/How can you end up with documents of other languages than the current > default one if the wiki is mono-lingual? > When importing a XAR for example. > > > > > > > > > > > > > > 2/ Allow is a wiki is multi-lingual: > > > > - search in the language you decide (maybe the UI should display a > > > > language choice for the query) > > > > > > > > > > We already support this, by using the "Filtered Search" option and > > > selecting the language. > > > > > > > Sure. I was just repeating what is important. > > > > > > > > > > - search in content that is analyzed in the proper language when the > > > > content is declared in this language > > > > - allow to specify if you want to restrict your search to documents > > > > declared in the language of your query, versus search more widely in > > all > > > > documents accross languages. If you search in only the language of > the > > > > query only one document can show up but it should point to the right > > > > translation that matches, if you search in multiple languages then > you > > > can > > > > show individual translations. > > > > > > > > > > I think this one is the same as the first bullet above. > > > > > > No I think it's different. The user can decide to make his search in > > "French" but look in the "French" + "English" dataset. For me if both > > documents with the same page name match, both should come out separately. > > > > > It's the same as in the UI allows you to specify which language to search > in, but it also allows you to specify that you want to search in "All > languages", that is what I meant. > > > > > > > > > - allow technical users to search for all documents across all > > > languages > > > > (where the language analysis does not really matter) > > > > > > > > > > Do you mean as an API? > > > > > > > Not specifically as an API > > > > > > > > > > > What exactly do you mean by "language analysis does not really matter"? > > Any > > > Example? > > > > > > > I mean here that as a technical user your objective is to make sure your > > search spans ALL content in the wiki. In this case you don't care about > > stemming and such. > > The standard UI could take this into account by choosing "any language > > search" and with data set "all languages". We just need to make sure that > > this won't exclude any content of the search > > > > (Here is an example of a case where such exclusion might occur. Suppose > you > > have done only "French" and "English" indexing and there is "German" > > content also in your wiki but since you have not asked for German search > in > > your wiki you don't have title_de and content_de fields or don't have a > > german specific solr index (in the other method), then your german > content > > would not be indexed at all ?) > > > > > I understand now what you mean. Well, to avoid this, we can use only one > notion and that is the one of "supported languages". All the supported > languages should be indexed in order for the Search feature to work as > expected (and not have the case where you have a document that is not > indexed because of its language). I am not sure if there are many examples > where admins would choose to index just some of the supported languages and > not index all of them. > > WDYT? > > > > > > > > > > > > > > > > > > > > From an admin point of view it makes good sense for the admin to be > > able > > > to > > > > specify in a multilingual wiki which language analysis should be > > > activated, > > > > and then have this transmitted to SOLR to properly configure the > > engine. > > > > Reindexing is ok when changing the configuration. > > > > > > > > I believe in the end wether you use multiple fields with _fr _en or > > > > multiple SOLR cores, as long as you can query accross SOLR cores is a > > bit > > > > the same. If you cannot run a query merging multiple indexes then the > > > first > > > > solution is kind of absolutely necessary as it would be the only one > > > > allowing to search across all languages. > > > > > > > > Maybe a solution would be to create one index per language and index > > ALL > > > > content regardless of it's language using the language analyzer of > that > > > > index. This would allow to have better results even though the users > > have > > > > badly tagged the language of a document, and it's only the job of the > > UI > > > to > > > > limit the search to only the language of the query, or all documents. > > > > > > > > > > So you could have a configuration in the admin that says: > > > > > > > > 1/ Create an English Index > > > > 2/ Create an additional French index > > > > > > > > The UI would allow to search in English and French, + would add a > > > language > > > > restriction for the documents. > > > > > > > > > > Applying the language specific analyzers (for Chinese, for example) to > > all > > > the documents will just create a mess for all the documents that do not > > > match the analyzer's language. I`m not sure the results for the > > > badly-indexed languages will make any sense to users. > > > > > > > > I undestand the issue here, but in most cases the user will say "french > > search" on "french content", he will only expand to non french content if > > he was not satisfied by his search. What is just allowed here is to also > > search in "french" inside all content. That would cover content that > would > > have the bad language setting as well as any other content. The results > > might be a bit noisy but I don't think it's a big issue. > > > > What I think you are actually asking is that all languages are queried, but > that results in the requested language get boosted/elevated and thus are > displayed before all other language results. > > Perhaps this can be achieved by using the lang field as a query field and > not as a query filter. This means that it will influence the score. Couple > that with a higher boost for the lang field and results in the requested > language should be scored better than all the rest (because there will > always be a hit for the lang field resulting in a better score). There > might still be some results from other languages that could have a better > score than results from the requested language (because they just have more > hits on diferent fields), but it should work in most cases. > > However, I`m not sure we want this. Users might results from other > languages as a bug. Problems might also be harder to debug. > > Instead, we could do something like Google and if we see a small number of > results, we could suggest that the user searches in all languages. > > In any case, I`m not sure this is such a big issue right now, since it's > basically an optimisation that we could choose to do, or let the user do it > himself, since he has the tools (UI). > > > > > > > Also, this is very similar to the multi-core approach (one core per > > > language), just that you also add documents that are indexed with the > > wrong > > > analyzers. We have the same problem regarding merging relevance scores > > > across indexes (cores) that is a big turn-off for the original > multi-core > > > approach. > > > > > > This is a more serious issue. If it's hard to merge the results > spanning > > over multiple cores this could be a showstopper. However the solution of > > having only one Lucene document for all languages is not so cool either > as > > it would make it difficult to know which ones of the languages has > matched > > and present them separately with separate scores. > > > > I agree. The one-document-for-all-languages solution was proposed by Paul, > but I don`t think it's the way to go. We are currently only considering the > one-document-per-language direction. > > > > It's really the core issue to decide on. What are the benefits and > > drawbacks of the different solutions. For each solution is there > something > > in the UI that you cannot do ? > > So far I've heard: > > > > 1/ Presenting different scores for documents in different languages with > > the same doc name if the title_fr,content_fr method is used > > > > Since we are not considering the one-document-for-all-languages solution, > this is not an issue. > > 2/ Merging scores accross indexes in multicore approach > > > > Other ? Can we list them in a wiki page ? > > > > > > > > > > > > In the future if we are able to "detect" the language of the > documents > > we > > > > could add a lucene field with the "detected" language instead of the > > > > "provided" language of the documents, therefore increasing the > quality > > of > > > > searches only on documents of a specific language. > > > > > > > > > > In the previous discussions (on the GSoC thread) we agreed that > language > > in > > > XWiki is known before-hard, so no recognition is required, at least not > > at > > > document level. > > > > > > Let's forget this > > > > > > > > > > > > > > > This later solution would be the only one that would really work on > > file > > > > attachements as we have no information about the specific language of > > > file > > > > attachements (or even XWiki objects) which are attached to the main > > > > document and not to the translated document. > > > > > > > > > > Yes, this is a problem right now. AFAIU, the plan [1] is to support > > > translated objects and maybe attachments as well. Until then, we could > > > either: > > > 1) Use the original document's language to index the attachment's > content > > > > > > > This is not a good solution. If I understand correctly we could not end > up > > not searching in a french attachment because the original document is in > > marked "en". > > I'm for Paul's solution to index objects and attachments in each > > translation (if we have separate entities for translated documents). > > > Yep, we already do this for objects, just need to do it for attachments as > well. > > > > I > > understand that in the title_fr,content_fr approach this problem does not > > happen. > > > > This problem is not related to how we handle multiple translations > (separate indexes or not), as long as each translation is a separate > entity. Basically this last part is the problem. If the French translation > is a separate entity from the original document (i.e. that is in English), > any object/attachment *index field* of the English original version will > need to be duplicated into the French translation as well, or the French > translation risks not to get a hit on the object/attachment. > > > > > > > 2) Use a language detection library to try to detect the attachment > > > content's language and index it accordingly. > > > > > > Not sure we can for now > > > > > > > The above could also be applied for objects and their properties. > > > ---------- > > > [1] http://jira.xwiki.org/browse/XWIKI-69 > > > > > > > > > > > > > > This later issues shows that a search on "only french content" should > > > still > > > > include the attachements because we have no idea if the attachments > are > > > > "french" or "english". > > > > > > > > > > (The paragraphs below discuss on what currently exists and what could > be > > > done, ignoring the possible language detection mentioned above) > > > > > > Right now a document also indexes the object's properties in a field > > called > > > "objcontent". I do this for all translations, thus duplicating the > > field's > > > value in all translations. I can do the same for attachments. The > purpose > > > is, indeed, to be able to find document translations based on hits in > > their > > > objects/attachments. If a language filter is used and there is a hit in > > an > > > object, only one document is returned. If there are no language > filters, > > > all translations will be returned. > > > > > > > It seems we have to do this for now > > > > > > > > > > However, if we search for the object/property/attachment itself, it > will > > > only be assigned to one language: the language of the original > document. > > > This means that if we search for all languages, the object itself will > be > > > found too (there is no language filter used). If we add a language > filter > > > that is different from the object/property/attachment's original > document > > > language, the object/property/attachment will not be found. > > > > > > Maybe we can come up with some processing of the query in the search > > > application, that applies the language filter only for documents: > > > > > > ((-type:"OBJECT" OR -type:"OBJECT_PROPERTY" OR -type:"ATTACHMENT") OR > > > lang:"<userSelectedLanguage>") -- writing it like this because the > > default > > > operand is AND in the query filter clause that we use in the Search > > > application. > > > > > > The problem with this is that that, when a language filter is used, the > > > object/property/attachments that are now included in the results might > > not > > > have the specified language and will pollute the results. > > > > > > > I'm not sure I understand. We have an "objcontent" field for each > > translation that has the full content of objects and properties, but > > individual object fields we don't have in each translations ? > > > > "objcontent" stores the properties (format: "propName:propValue") of each > object inside the original document (multi-valued field). > > Besides XWikiDocuments, we also index Objects, Properties and Attachments > as Lucene/Solr first-class documents. Each of these entries has the wiki, > space, name and lang fields set from the document to which they belong to. > This is what I meant above with "if we search for the > object/property/attachment itself" > > So to reiterate, the idea was that if you want to search for an indexed > Object (type:"OBJECT"), you will *have* to avoid setting a language filter, > or you might not find the object you are looking for, since it is indexed > under the language of original document. > > --- > > While writing this down, I just thought of an elegant idea to fix this. The > lang field could be set to multi-valued. This means that, when we index > objects, properties and attachments, we could set in the lang field all the > values corresponding to all the existing translations of the owning > document. Example: > > An Object: > id: xwiki:Main.Page^XWiki.Class[0] <-- (object reference) > class: XWiki.Class > wiki: xwiki > space: Main > name: Page > lang: en <-- proposal to make it multi-valued would look like this, > stored like a list of values. > fr > de > type: OBJECT > > Note that this solution would also affect document languages as well (since > they share the same schema), but we will just put one value and it will not > affect queries. Queries will still be written like: "lang:en" > > If we apply this solution, even if a language filter exists in the user's > query, it will still hit the object because the lang field of the object > contains the value. > > > > The more I see all the issues, the more I lean towards a separate index > per > > language solution. > > > Again, these issues are not related to this, just to the fact that Document > translations are first-class Lucene documents. > > > > The reason I do is that the main need is for a non > > English user to have very relevant results in his own language. Therefore > > we need to make sure that all content that the users have published has > the > > chance to be analyzed using the non English language analyzer. So > indexing > > all objects and attachments with the relevant language analyzer is the > > solution. This is also why I proposed to index all content in this > specific > > index regardless of the language declared, which would only be used in > the > > UI to limit searches to the specific language. > > > > In this view: > > > > 0/ There would be a language specific index per language with the objects > > and attachments indexed only in the language of the index > > 1/ The user chooses the language in which he searches > > 2/ Automatically that sets the index to be used to be the "french" index > > 3/ Automatically that presets to limit the span of the search to declared > > "french" documents > > 4/ The user can decide to go for non french documents at his own risks > > knowing that the results might be weird because of wrong analysis (this > is > > what happens today with english analysis over french documents) > > > > The benefit here is that you don't have a merging score over multiple > index > > issue, since you would never have to do a search across multiple indexes. > > Searches are still simple to write. By default results are quite relevant > > since you limit the search the french declared documents (this would be > the > > same as limiting your search to title_fr, content_fr) and still cover > what > > needs to be covered (objects and attachments). > > Another benefit is that this falls back gracefully to monolingual as you > > just have to have one index in the language declared for the monolingual > > wiki. > > The drawback is that the indexing is more costly and there is duplicated > > content in the index. Howerver it is the Admin that say which languages > he > > wants available and he takes responsibility of the ressources this needs. > > > > Could this solution work ? > > > > Unfortunately I am still not convinced by this approach. Besides the > complexities of managing multiple cores (each with it's own schema and > config files), the user is exposed to a lot of unnecessary badly indexed > data that will make him stay away from the search feature, as it has > happened with Lucene so far. > > I believe this discussion thread has come up with some nice solutions to > most of our problems and that the multiple-fields direction is one that can > give us relevant results, properly indexed content for all languages (even > when searching in all languages) and a good solution for objects, > attachments and properties (though, again, this is not related to this > specific choice). > > I will try to make a summary of the ideas from this thread and put them > into a wiki page that documents the design/progress of the multi-lingual > related work. > > Thank you very much for now and, of course, this does not mean that the > discussion is over in any way :) > -Eduard > > > > > Ludovic > > > > > > > > > > > > > Thanks, > > > Eduard > > > > > > > > > > Ludovic > > > > > > > > > > > > > > > > 2012/11/26 Eduard Moraru <[email protected]> > > > > > > > > > Hi devs, > > > > > > > > > > Any other input on this matter? > > > > > > > > > > To summarize a bit, if we go with the multiple fields for each > > > language, > > > > we > > > > > end up with an index like: > > > > > > > > > > English version: > > > > > id: xwiki:Main.SomeDocument_en > > > > > language: en > > > > > space: Main > > > > > title_en: XWiki document > > > > > doccontent_en: This is some content > > > > > > > > > > French version: > > > > > id: xwiki:Main.SomeDocument_fr > > > > > language: fr > > > > > space: Main > > > > > title_fr: XWiki document > > > > > doccontent_fr: This is some content > > > > > > > > > > The Solr configuration is generated by some XWiki UI that returns a > > zip > > > > > that the admin has to unpack in his (remote) Solr instance. This > > could > > > be > > > > > automated for the embedded instance. This operation is to be > > performed > > > > each > > > > > time an admin changes the indexed languages (rarely or even only > > once). > > > > > > > > > > Querying such a schema is a bit tricky when you are interested in > > more > > > > than > > > > > one language, because you have to add all the clauses (title_en, > > > > title_fr, > > > > > etc.) specific to the languages you are interested in. > > > > > > > > > > Some extra fields might also be added like title_ws (for whitespace > > > > > tokenization only) that have various approaches to the indexing > > > > operation, > > > > > with the aim of improving the relevancy. > > > > > > > > > > One solution to simplify the query for API clients would be to use > > > fields > > > > > like "title" and "doccontent" and to put as values very lightly (or > > not > > > > at > > > > > all) analyzed content, as Paul suggested. This would allow > > applications > > > > to > > > > > write simple (and backwards compatible maybe) queries that will > still > > > > work, > > > > > but will not catch some of the nuances of specific languages. As > far > > as > > > > > I`ve seen until now, applications are not very interested in > nuances, > > > but > > > > > rather in filtering the results, a task for which this solution > might > > > be > > > > > well suited. Of course, nothing stops applications from using the > > *new* > > > > and > > > > > more expressive fields that are properly analized. > > > > > > > > > > Thus, the search application will be the major beneficiary of these > > > > > analyzed fields (title_en, title_fr, etc.), while still allowing > > > > > applications to get their job done (trough generic, but less/not > > > analized > > > > > fields like "title", "doccontent", etc.). > > > > > > > > > > WDYT? > > > > > > > > > > Thanks, > > > > > Eduard > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru < > > [email protected] > > > > > >wrote: > > > > > > > > > > > Hi Paul, > > > > > > > > > > > > I was counting on your feedback :) > > > > > > > > > > > > On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > >> > > > > > >> Hello Eduard, > > > > > >> > > > > > >> it's nice of you to see you take this further. > > > > > >> > > > > > >> > This issue has already been previously [1] discussed during > the > > > GSoC > > > > > >> > project, but I am not particularly happy with the chosen > > approach. > > > > > >> > When handling multiple languages, there are generally[2][3] 3 > > > > > different > > > > > >> > approaches: > > > > > >> > > > > > > >> > 1) Indexing the content in a single field (like title, > > doccontent, > > > > > etc.) > > > > > >> > - This has the advantage that queries are clear and fast > > > > > >> > - The disadvantage is that you can not run very well tuned > > > analyzers > > > > > on > > > > > >> the > > > > > >> > fields, having to resort to (at best) basic tokenization and > > > > > >> lowercasing. > > > > > >> > > > > > > >> > 2) Indexing the content in multiple fields, one field for each > > > > > language > > > > > >> > (like title_en, title_fr, doccontent_en, doccontent_fr, etc.) > > > > > >> > - This has the advantage that you can easily specify (as > dynamic > > > > > fields) > > > > > >> > that *_en fields are of type text_en (and analyzed by an > > > > > >> english-centered > > > > > >> > chain of analyzers); *_fr of type text_fr (focused on french, > > > etc.), > > > > > >> thus > > > > > >> > making the results much better. > > > > > >> > > > > > >> I would add one more field here: title_ws and text_ws where the > > full > > > > > text > > > > > >> is analyzed just as words (using the whitespace-tokenizer?). > > > > > >> A match there would even be preferred to a match in the below > > > > > text-fields. > > > > > >> > > > > > >> (maybe that would be called title and text?) > > > > > >> > > > > > >> > - The disadvantage is that querying such a schema is a pain. > If > > > you > > > > > want > > > > > >> > all the results in all languages, you end up with a big and > > > > expensive > > > > > >> > query. > > > > > >> > > > > > >> Why is this an issue? > > > > > >> Dismax does it for you for free (thanks to the "form" parameter > > that > > > > > >> gives weight to each of the fields). > > > > > >> This is an issue only if you start to have more than 100 > languages > > > or > > > > > >> so... > > > > > >> Lucene, the underlying engine of solr, handles thousands of > > clauses > > > > in a > > > > > >> query without an issue (this is how prefix-queries are > handled... > > > they > > > > > are > > > > > >> expanded to a query for any of the term that matches the > prefix, a > > > > > setting > > > > > >> deep somewhere, which is about 2000 avoids this to explode). > > > > > >> > > > > > > > > > > > > Sure, Solr is great when you want to do simple queries like > "XWiki > > > Open > > > > > > Source", however, since in XWiki we also expose the Solr/Lucene > > query > > > > > APIs > > > > > > to the platform, there will be (as as it is currently with > Lucene) > > a > > > > lot > > > > > of > > > > > > extensions wanting to do search using this API. These extensions > > > (like > > > > > the > > > > > > search suggest for example, rest search, etc) want to do > something > > > like > > > > > > "title:'Open Source' AND type:document AND doccontent:XWiki". > > Because > > > > > > option 2) is so messy in it's fields, it would mean that the > > > extension > > > > > > would have to come up with a query like "title_en:'Open Source' > AND > > > > > > type:document AND doccontent_en:XWiki" (assuming that it is only > > > > limited > > > > > to > > > > > > the current -- english or whatever -- language; what happens if > it > > > > wants > > > > > to > > > > > > do that no matter what language? It will have to specify each > > > > combination > > > > > > possible because we can't use generic field names). > > > > > > > > > > > > Solr's approach works for using it in your web application's > search > > > > > input, > > > > > > in a specific usecase, where you have precisely specified the > > default > > > > > > search fields and their boosts inside your schema.xml. However, > as > > a > > > > > search > > > > > > API, using option 2) you are making the life of anyone else > wanting > > > to > > > > > use > > > > > > the Solr search API really hard. Also, your search application > will > > > > work > > > > > > nicely when the user enters a simple query in the input field, > but > > an > > > > > > advanced user will suffer the same fate when trying to write an > > > > advanced > > > > > > query, thus not relying on the default query (computed by solr > > based > > > on > > > > > > schema.xml). > > > > > > > > > > > > Also, based on your note above regarding improvements like > title_ws > > > and > > > > > > such, again, all of these are very well suited for the search > > > > application > > > > > > use case, together with the default query that you configure in > > > > > schema.xml, > > > > > > making the search results perform really well. However, what does > > all > > > > > these > > > > > > fields mean to another extension wanting to do search? Will it > have > > > to > > > > > > handle all these implementation details to query for title, > content > > > and > > > > > > such? I`m not sure how well this would work in practice. > > > > > > > > > > > > Unrealistic idea(?): perhaps we should come up with an abstract > > > search > > > > > > language (Solr/Lucene clone) that parses the searched fields > > andhides > > > > the > > > > > > complexities of all the indexed fields, allowing to write simple > > > > queries > > > > > > like "title:XWiki", while this gets translated to "title_en:XWiki > > OR > > > > > > title_fr:XWiki OR title_de:XWiki..." :) > > > > > > > > > > > > Am I approaching this wrong by trying to have both a > > > tweakable/tweaked > > > > > > search application AND a search API? Are the two not compatible? > Do > > > we > > > > > have > > > > > > to sacrifice search result performance (no language-specific > stuff) > > > to > > > > be > > > > > > able to have a usable API? > > > > > > > > > > > > > > > > > >> > If you want just some language, you have to read the right > > fields > > > > > >> > (ex title_en) instead of just getting a clear field name > > (title). > > > > > >> > > > > > >> You have to be careful, this is really only if you want to be > > > > specific. > > > > > >> In this case, it is likely that you also do not want so much > > > stemming. > > > > > >> My experience, which was before dismax on curriki.org, has made > > it > > > so > > > > > >> that any query that is a bit specific is likely to not desire > > > > stemming. > > > > > >> > > > > > > > > > > > > Can you please elaborate on this? I`m not sure I understand the > > > > problem. > > > > > > > > > > > > > > > > > >> > > > > > >> > -- Also, the schema.xml definition is a static one in this > > > concern, > > > > > >> > requiring you to know beforehand which languages you want to > > > support > > > > > >> (for > > > > > >> > example when defining the default fields to search for). > Adding > > a > > > > new > > > > > >> > language requires you to start editing the xml files by hand. > > > > > >> > > > > > >> True but the available languages are almost all hand-coded. > > > > > >> You could generate the schema.xml based on the available > languages > > > if > > > > > not > > > > > >> hand-generated? > > > > > >> > > > > > > > > > > > > Basically I would have to output a zip with schema.xml, > > > solrconfig.xml > > > > > and > > > > > > then all the resources specific to all the selected languages > > > > (stopwords, > > > > > > synonims, etc) for the languages that we can provide out of the > > box. > > > > For > > > > > > other languages, the admin would have to get dirty with the xmls. > > > > > > > > > > > > > > > > > >> > > > > > >> There's one catch with this approach which is new to me but > seems > > to > > > > be > > > > > >> quite important to implement this approach: the idf should be > > > > modified, > > > > > the > > > > > >> Similarity class should be, so that the total number of > documents > > is > > > > the > > > > > >> total number of documents having that language. > > > > > >> See: > > > > > >> > > > > > >> > > > > > > > > > > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%[email protected]%3E > > > > > >> The solution sketched there sounds easy but I have not tried it. > > > > > >> > > > > > >> > 3) Indexing the content in different Solr cores (indexes), one > > for > > > > > each > > > > > >> > language. Each core requires it's on directory and > configuration > > > > > files. > > > > > >> > - The advantage is that queries are clean to write (like > option > > 1) > > > > and > > > > > >> that > > > > > >> > you have a nice separation > > > > > >> > - The disadvantage is that it's difficult to get it right > > > > > >> (administrative > > > > > >> > issues) and then you also have the (considerable) problem of > > > having > > > > to > > > > > >> fix > > > > > >> > the relevancy score of a query result that has entries from > > > > different > > > > > >> > cores; each core has it's own relevancy computed and does not > > > > consider > > > > > >> the > > > > > >> > others. > > > > > >> > - To make it even worst, it seems that you can not [5] also > push > > > to > > > > a > > > > > >> > remote Solr instance the configuration files when creating a > new > > > > core > > > > > >> > programatically. However, if we are running an embedded Solr > > > > instance, > > > > > >> we > > > > > >> > could provide a way to generate the config files and write > them > > to > > > > the > > > > > >> data > > > > > >> > directory. > > > > > >> > > > > > >> Post-processing results is very very very dangerous as > performance > > > is > > > > at > > > > > >> risk (e.g. if a core does not answer)... I would tend to avoid > > that > > > as > > > > > much > > > > > >> as possible. > > > > > >> > > > > > > > > > > > > Not really related, but this reminds me about the post processing > > > that > > > > I > > > > > > do for checking view rights over the returned result, but that's > > > > another > > > > > > discussion that we will probably need to have :) > > > > > > > > > > > > > > > > > >> > > > > > >> > Currently I have implemented option 1) in our existing Solr > > > > > integration, > > > > > >> > which is also more or less compatible with our existing Lucene > > > > > queries, > > > > > >> but > > > > > >> > I would like to find a better solution that actually analyses > > the > > > > > >> content. > > > > > >> > > > > > > >> > During GSoC, option 2) was preferred but the implementation > did > > > not > > > > > >> > consider practical reasons like the ones described above > (query > > > > > >> complexity, > > > > > >> > user configuration, etc.) > > > > > >> > > > > > >> True, Savitha surfed the possibility of having different solr > > > > documents > > > > > >> per language. > > > > > >> I still could not be sure that this was not showing the document > > > match > > > > > >> single in one language. > > > > > >> > > > > > >> However, indicating which language it is matched into is > probably > > > > > >> useful... > > > > > >> > > > > > > > > > > > > Already doing that. > > > > > > > > > > > > > > > > > >> Funnily, cross-language-retrieval is a mature research field but > > > > > >> retrieval for multilanguage user is not so! > > > > > >> > > > > > >> > On a related note, I have also watched an interesting > > presentation > > > > [3] > > > > > >> > about how Drupal handles its Solr integration and, > > particularly, a > > > > > >> plugin > > > > > >> > [4] that handles the multilingual aspect. > > > > > >> > The idea seen there is that you have this UI that helps you > > > generate > > > > > >> > configuration files, depending you your needs. For instance, > you > > > > > (admin) > > > > > >> > check that you need search for language English, French and > > German > > > > and > > > > > >> the > > > > > >> > ui/extension gives you a zip with the configuration you need > to > > > use > > > > in > > > > > >> your > > > > > >> > (remote or embedded) solr instance. The configuration for each > > > > > language > > > > > >> > comes preset with the analyzers you should use for it and the > > > > > additional > > > > > >> > resources (stopwords.txt, synonims.txt, etc.). > > > > > >> > This approach helps with avoiding the need for admins to be > > forced > > > > to > > > > > >> edit > > > > > >> > xml files and could also still be useful for other cases, not > > only > > > > > >> option > > > > > >> > 2). > > > > > >> > > > > > >> Generating sounds like an easy approach to me. > > > > > >> > > > > > > > > > > > > Yes, however I don`t like the fact that we can not do everything > > from > > > > the > > > > > > webapp and the admin needs to access the filesystem to install > the > > > > given > > > > > > configuration on the embedded/remote solr directory. Lucene does > > not > > > > have > > > > > > this problem now. It just works with XWiki and everything is done > > > from > > > > > > XWiki UI. I feel that losing this commodity will not be very well > > > > > received > > > > > > by users that now have some new install steps to get XWiki > running. > > > > > > > > > > > > Well, of course, for the embedded solr version, we could handle > it > > > like > > > > > we > > > > > > do now and push the files directly from the webapp to the > > filesystem. > > > > > Since > > > > > > embedded will be default, it should be OK and avoid the extra > > install > > > > > step. > > > > > > Users with a remote solr machine should have the option to get > the > > > zip > > > > > > instead. > > > > > > > > > > > > Not sure if we can apply the new configuration without a restart, > > but > > > > > I`ll > > > > > > have to look more into it. I know the multi-core architecture > > > supports > > > > > > something like this but will have to see the details. > > > > > > > > > > > > > > > > > >> > > > > > >> > All these problems basically come from the fact that there is > no > > > way > > > > > to > > > > > >> > specify in the schema.xml that, based on the value of a field > > > (like > > > > > the > > > > > >> > field "lang" that stores the document language), you want to > run > > > > this > > > > > or > > > > > >> > that group of analyzers. > > > > > >> > > > > > >> Well, this is possible with ThreadLocal but is not necessarily a > > > good > > > > > >> idea. > > > > > >> Also, it is very common that users formulate queries without > > > > formulating > > > > > >> their language and thus you need to "or" the user's queries > > through > > > > > >> multiple languages (e.g. given by the browser). > > > > > >> > > > > > >> > Perhaps a solution would be a custom kind of > > "AggregatorAnalyzer" > > > > that > > > > > >> > would call other analyzers at runtime, based on the value of > the > > > > lang > > > > > >> > field. However, this solution could only be applied at index > > time, > > > > > when > > > > > >> you > > > > > >> > have the lang information (in the solrDocument to be indexed), > > but > > > > > when > > > > > >> you > > > > > >> > perform the query, you can not analyze the query text since > you > > do > > > > not > > > > > >> know > > > > > >> > the language of the field you're querying (it was determined > at > > > > > runtime > > > > > >> - > > > > > >> > at index time) and thus do not know what operations to apply > to > > > the > > > > > >> query > > > > > >> > (to reduce it to the same form as the indexed values). > > > > > >> > > > > > >> How would that look at query time? > > > > > >> > > > > > > > > > > > > That's what I was saying, that at query time, the searched term > > will > > > > not > > > > > > get analyzed by the right chain. When you search for a single > > > language, > > > > > you > > > > > > could add that language as a query filter and then you could > apply > > > the > > > > > > right chain, but when searching in 2 or more (or no, meaning all) > > > > > languages > > > > > > you are stuck. > > > > > > > > > > > >> > > > > > >> > I have also read another interesting analysis [6] on this > > problem > > > > that > > > > > >> > elaborates on the complexities and limitations of each > options. > > > > > (Ignore > > > > > >> the > > > > > >> > Rosette stuff mentioned there) > > > > > >> > > > > > > >> > I have been thinking about this for some time now, but the > > > solution > > > > is > > > > > >> > probably somewhere in between, finding an option that is > > > acceptable > > > > > >> while > > > > > >> > not restrictive. I will probably also send a mail on the Solr > > list > > > > to > > > > > >> get > > > > > >> > some more input from there, but I get the feeling that > whatever > > > > > >> solution we > > > > > >> > choose, it will most likely require the users to at least copy > > (or > > > > > even > > > > > >> > edit) some files into some directories (configurations and/or > > > jars), > > > > > >> since > > > > > >> > it does not seem to be easy/possible to do everything > > on-the-fly, > > > > > >> > programatically. > > > > > >> > > > > > >> The only hard step is when changing the supported languages, I > > > think. > > > > > >> In this case, when automatically generating the index, you need > to > > > > warn > > > > > >> the user. > > > > > >> The admin UI should have a checkbox "use generated schema" or a > > > > textarea > > > > > >> for the schema. > > > > > >> > > > > > > > > > > > > Please see above regarding configuration generation. Basically, > > since > > > > we > > > > > > are going to support both embedded and remote solr instances, we > > > could > > > > > > support things like editing the schema from XWiki only for the > > > embedded > > > > > > instance, but not for the remote one. We might end up having > > separate > > > > UIs > > > > > > for each case, since we might want to exploit the flexibility of > > the > > > > > > embedded one as much as possible. > > > > > > > > > > > > > > > > > >> > > > > > >> Those that want particular fields and tunings need to write > their > > > own > > > > > >> schema. > > > > > >> > > > > > >> The same UI could also include whether to include a phonetic > track > > > or > > > > > not > > > > > >> (then require reindexing). > > > > > > > > > > > > > > > > > >> hope it helps. > > > > > >> > > > > > > > > > > > > Yes, very helpful so far. I`m counting on your expertise with > > > > Lucene/Solr > > > > > > on the details. My current approach is a practical one without > > > previous > > > > > > experience on the topic, so I`m still doing mostly guesswork in > > some > > > > > areas. > > > > > > > > > > > > Thanks, > > > > > > Eduard > > > > > > > > > > > > > > > > > >> paul > > > > > >> _______________________________________________ > > > > > >> devs mailing list > > > > > >> [email protected] > > > > > >> http://lists.xwiki.org/mailman/listinfo/devs > > > > > >> > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > devs mailing list > > > > > [email protected] > > > > > http://lists.xwiki.org/mailman/listinfo/devs > > > > > > > > > > > > > > > > > > > > > -- > > > > Ludovic Dubost > > > > Founder and CEO > > > > Blog: http://blog.ludovic.org/ > > > > XWiki: http://www.xwiki.com > > > > Skype: ldubost GTalk: ldubost > > > > _______________________________________________ > > > > devs mailing list > > > > [email protected] > > > > http://lists.xwiki.org/mailman/listinfo/devs > > > > > > > _______________________________________________ > > > devs mailing list > > > [email protected] > > > http://lists.xwiki.org/mailman/listinfo/devs > > > > > > > > > > > -- > > Ludovic Dubost > > Founder and CEO > > Blog: http://blog.ludovic.org/ > > XWiki: http://www.xwiki.com > > Skype: ldubost GTalk: ldubost > > _______________________________________________ > > devs mailing list > > [email protected] > > http://lists.xwiki.org/mailman/listinfo/devs > > > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs > -- Thomas Mortagne _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

