hi, thanks for your response :
> On the web it is often hard to trust such (e.g. because of people working in multiple languages, internet cafés...) but... it is your choice. our web app has a language selector for the user to choose the GUI language >After? >Would "shallow matches" in the right language come after "precise matches" in a wrong language? yes, that's the idea. Either that or present the results per language in separate result sets (with sorting options per result set, etc) > In both solr and simple lucene, using different fields allows different analyzers, that's how you want things (e.g. a different stemmer per language). yes, in the single index solution we do use different analyzers for different fields > The important bit is to use query-expansion. > Given a query of the user (with params or not, with text-queries), expand it to a query where the "normal text" is expected to be in the right language, but maybe also in one of the other languages (that > the browser says, that your platform supports), with less weight of course. something like that we do now in a single index solution - results in the requested language are boosted enough so they're always on top I don't think though that this addresses what is my main point: the frequency of terms in different domains (in this case, different languages) is different for each domain. This means that if the domains are chunked together in one index, the IDF value for a term is less "accurate" than if multiple, separate indexes were used. A term is more or less frequent in one domain or another, for a reason.. Relevance ranking is impacted by that, and is more accurate if separate indexes are used -- I think this seems logical. I just don't know how much impact it really has, and whether it is worth to deal with it by presenting separate result sets from separate index searches .. thanks for your reply ! Heikki Doeleman On Tue, Jan 3, 2012 at 2:51 PM, Paul Libbrecht <p...@hoplahup.net> wrote: > > Le 3 janv. 2012 à 13:56, heikki a écrit : > > > In our case, it is "known" in which language the user is searching > (because > > he tells us, and if he doesn't, we use the current GUI language). > > On the web it is often hard to trust such (e.g. because of people working > in multiple languages, internet cafés...) but... it is your choice. > > > Results > > are returned so that results in the requested language are ordered on > top, > > and within that, ordered by relevance. Results in other languages are > also > > returned, and presented after the requested-language results, ordered by > > relevance. > > After? > Would "shallow matches" in the right language come after "precise matches" > in a wrong language? > > > If the results in the requested language contain say one which has term A > > and one which has term B, their positions in the relevance ranking > (within > > the requested-language results on top) can be influenced by occurrences > of > > terms A and B in the other languages, if a single search is used. > > > > I agree to the apples/oranges remark: if a term occurs in more than one > > language, likely its IDF frequency is different for each language, so to > > have the best relevance ranking there should be separate indexes for each > > language. And searches should be really separate searches (no > MultiSearcher > > which would produce combined relevance scores). So the results should > also > > be presented as several, separate result sets. > > > I believe the right solution for this is simple: use different fields per > langauge. > > In both solr and simple lucene, using different fields allows different > analyzers, that's how you want things (e.g. a different stemmer per > language). > > Using different indexes is certainly a hassle, different fields not really. > > The important bit is to use query-expansion. > Given a query of the user (with params or not, with text-queries), expand > it to a query where the "normal text" is expected to be in the right > language, but maybe also in one of the other languages (that the browser > says, that your platform supports), with less weight of course. > > Query expansion is done by post-processing the result of the query-parser > in my case. > > Then you can also differentiate fields which are precise matches and less: > make one field with exact match (using the whitespace-tokenizer), one field > with stemmed match (e.g. using the porter family), one field with phonetic > matches. > > Hope it helps. > > paul > > > Does anyone have experience with this ? Opinions ? Is the improved > relevance > > per language worth the "hassle" of having separate indexes, doing > separate > > searches and presenting results per language ? We do already take care of > > using appropriate stopwords/differnt analyzers when indexing and > searching a > > particular language, but that's a different issue obviously. > >