Re: Language support
Synonyms are also domain specific. A synonym set for one area may be completely wrong in another. In cooking, arugula and rocket are the same thing. In military or aerospace, missile and rocket are very similar. I would start with librarians. They maintain controlled vocabularies (called “thesauri”). Usually, a thesaurus has the official classification terms but also has “entry terms”. The entry terms are alternate terms that are used to get to the primary term. For example, the category might be “electric vehicle”, but an entry term could be “zero emission vehicle”. Good luck. I had a hard time finding thesauri on line a few years ago. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 23, 2016, at 7:38 AM, Bradley Belyeu> wrote: > > Hi, I’m trying to find a synonym list for any of the following languages: > Catalan, Farsi, Hindi, Korean, Latvian, Dutch, Romanian, Thai, and Turkish > Does anyone know of resources where I can get a synonym list for these > languages?
Language support
Hi, I’m trying to find a synonym list for any of the following languages: Catalan, Farsi, Hindi, Korean, Latvian, Dutch, Romanian, Thai, and Turkish Does anyone know of resources where I can get a synonym list for these languages?
Re: What are the best practices on Multiple Language support in Solr Cloud ?
Thanks Nicole. Leveraging dynamic field definitions is a great idea. Probably work for me as I've a bunch of fields which are indexed as String. Just curious about the sharding, are you using Solr Cloud. I thought of taking the dedicated shard / core route , but then, as using a composite key (for dedup), managing dedicated core can cause issues at times. As far as single field representation, thanks for validating my concern. Probably its best to use when you've to address a multi-lingual search. -- View this message in context: http://lucene.472066.n3.nabble.com/What-are-the-best-practices-on-Multiple-Language-support-in-Solr-Cloud-tp4134006p4134743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What are the best practices on Multiple Language support in Solr Cloud ?
Hi Shamik, I don't have an answer for you, just a couple of comments. Why not use dynamic field definitions in the schema? As you say most of your fields are not analysed you just add a language tag _en, _fr, _de, ...) to the field when you index or query. Then you can add languages as you need without having to touch the schema. For fields that you do analyse (stop words or synonyms) then you'll have to explicitly define a field type for them. My experience with docs that are in two or three main languages is that single core or multi-core has not been that critical, sharding and replication made a bigger difference to us. You could put english in one core and everything else in another. What we tried to do was just index stuff to the same field, that is french and english getting indexed to contents or title field (we have our own tokenizer and filter chain so did actually analyse them differently) but we got into lots of problems with tf-idf, so I'd advise to not do that. The motivation was that we wanted multi-ligual results. Terry's approach here is much better, and as you thought is addressing the multi-lingual requirement, but I still don't think it totally addresses the tf-idf problem. So if you don't need multilingual don't go that route. I am curious to see what other people think. Niki
What are the best practices on Multiple Language support in Solr Cloud ?
Hi, I'm trying to implement multiple language support in Solr Cloud (4.7). Although we've different languages in index, we were only supporting english in terms of index and query. To provide some context, our current index size is 35 GB with close to 15 million documents. We've two shards with two replicas per shard. I'm using composite id to support de-duplication, which puts the documents having the same field (dedup) value to a specific shard. Language is known prior to for every document being indexed. That saves the need for runtime language detection. Similarly, during query, the language will be known as well. To extend it, there's no need for multi-lingual support. Based on my understanding so far, there are three approaches which are widely adopted. Multi-field indexing, Multi-Core indexing and Multiple language in one field (based from Solr in Action). First option seems easy to implement. But then, I've around 40 fields which are getting indexed currently, though a majority of them are type=string and not being analyzed. I'm planning to support around 10 languages, which translates to 400 field definitions in the same schema. And this is poised to grow with addition of languages and fields. My apprehension is whether this approach becomes a maintenance nightmare ? Does it affect overall scalability ? Does is affect any existing features like Suggester, Spellcheck, etc. ? I was thinking of including language as part of the id key. It'll look like Language!Dedup_id!url so that documents are spread across the two shards. Second option of a dedicated core sounds easy in terms of maintaining config files. Also,routing requests will be fairly easy as the language will be always known up-front,both during indexing and query time. But, as I looked into the documents, 60% of our total index will be in English, while rest 40% will constitute remaining 10-14 languages. Some language content are in few thousands which perhaps doesn't merit a dedicate core. On top of that, this approach has the potential of getting into a complex infrastructure, which might be hard to maintain. I read about the use of multiple language in a single field in Trey Grainger's book. It looks like a great approach but not sure if it is meant to address my scenario. My first impression is that it's more geared towards supporting multi-lingual, but I maybe completely wrong. Also, this is not supported by Solr / Lucene out of the box. I know there's a lot of people in this group who have excelled as far as supporting multiple language in Solr is concerned. I'm trying to gather their inputs / experience on the best practice to help me decide the right approach. Any pointer on this will be highly appreciated. Thanks, Shamik
Re: eDisMax, multiple language support and stopwords
Happy to see some one have similar solutions as ours. we have similar multi-language search feature and we index different language content to _fr, _en field like you've done but in search, we need a language code as a parameter to specify the language client wants to search on which is normally decided by the website visited, such as: qf=name descriptionlanguage=en and in our search components we find the right field: name_en and description_en to be searched on we used to support on all language search and removed that later, as the site tells the customer which language is supported, we also don't think we have many language experts on our web sites that knows more than two language and need to search them at the same time. On 7 November 2013 23:01, Tom Mortimer tom.m.f...@gmail.com wrote: Ah, thanks Markus. I think I'll just add the Boolean operators to the stopwords list in that case. Tom On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io wrote: This is an ancient problem. The issue here is your mm-parameter, it gets confused because for separate fields different amount of tokens are filtered/emitted so it is never going to work just like this. The easiest option is not to use the stopfilter. http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html https://issues.apache.org/jira/browse/SOLR-3085 -Original message- From:Tom Mortimer tom.m.f...@gmail.com Sent: Thursday 7th November 2013 12:50 To: solr-user@lucene.apache.org Subject: eDisMax, multiple language support and stopwords Hi all, Thanks for the help and advice I've got here so far! Another question - I want to support stopwords at search time, so that e.g. the query oscar and wilde is equivalent to oscar wilde (this is with lowercaseOperators=false). Fair enough, I have stopword and in the query analyser chain. However, I also need to support French as well as English, so I've got _en and _fr versions of the text fields, with appropriate stemming and stopwords. I index French content into the _fr fields and English into the _en fields. I'm searching with eDisMax over both versions, e.g.: str name=qfheadline_en headline_fr/str However, this means I get no results for oscar and wilde. The parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:and)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord If I add and to the French stopwords list, I *do* get results, and the parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord This implies that the only solution is to have a minimal, shared stopwords list for all languages I want to support. Is this correct, or is there a way of supporting this kind of searching with per-language stopword lists? Thanks for any ideas! Tom -- All the best Liu Bo
eDisMax, multiple language support and stopwords
Hi all, Thanks for the help and advice I've got here so far! Another question - I want to support stopwords at search time, so that e.g. the query oscar and wilde is equivalent to oscar wilde (this is with lowercaseOperators=false). Fair enough, I have stopword and in the query analyser chain. However, I also need to support French as well as English, so I've got _en and _fr versions of the text fields, with appropriate stemming and stopwords. I index French content into the _fr fields and English into the _en fields. I'm searching with eDisMax over both versions, e.g.: str name=qfheadline_en headline_fr/str However, this means I get no results for oscar and wilde. The parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:and)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord If I add and to the French stopwords list, I *do* get results, and the parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord This implies that the only solution is to have a minimal, shared stopwords list for all languages I want to support. Is this correct, or is there a way of supporting this kind of searching with per-language stopword lists? Thanks for any ideas! Tom
RE: eDisMax, multiple language support and stopwords
This is an ancient problem. The issue here is your mm-parameter, it gets confused because for separate fields different amount of tokens are filtered/emitted so it is never going to work just like this. The easiest option is not to use the stopfilter. http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html https://issues.apache.org/jira/browse/SOLR-3085 -Original message- From:Tom Mortimer tom.m.f...@gmail.com Sent: Thursday 7th November 2013 12:50 To: solr-user@lucene.apache.org Subject: eDisMax, multiple language support and stopwords Hi all, Thanks for the help and advice I've got here so far! Another question - I want to support stopwords at search time, so that e.g. the query oscar and wilde is equivalent to oscar wilde (this is with lowercaseOperators=false). Fair enough, I have stopword and in the query analyser chain. However, I also need to support French as well as English, so I've got _en and _fr versions of the text fields, with appropriate stemming and stopwords. I index French content into the _fr fields and English into the _en fields. I'm searching with eDisMax over both versions, e.g.: str name=qfheadline_en headline_fr/str However, this means I get no results for oscar and wilde. The parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:and)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord If I add and to the French stopwords list, I *do* get results, and the parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord This implies that the only solution is to have a minimal, shared stopwords list for all languages I want to support. Is this correct, or is there a way of supporting this kind of searching with per-language stopword lists? Thanks for any ideas! Tom
Re: eDisMax, multiple language support and stopwords
Ah, thanks Markus. I think I'll just add the Boolean operators to the stopwords list in that case. Tom On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io wrote: This is an ancient problem. The issue here is your mm-parameter, it gets confused because for separate fields different amount of tokens are filtered/emitted so it is never going to work just like this. The easiest option is not to use the stopfilter. http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html https://issues.apache.org/jira/browse/SOLR-3085 -Original message- From:Tom Mortimer tom.m.f...@gmail.com Sent: Thursday 7th November 2013 12:50 To: solr-user@lucene.apache.org Subject: eDisMax, multiple language support and stopwords Hi all, Thanks for the help and advice I've got here so far! Another question - I want to support stopwords at search time, so that e.g. the query oscar and wilde is equivalent to oscar wilde (this is with lowercaseOperators=false). Fair enough, I have stopword and in the query analyser chain. However, I also need to support French as well as English, so I've got _en and _fr versions of the text fields, with appropriate stemming and stopwords. I index French content into the _fr fields and English into the _en fields. I'm searching with eDisMax over both versions, e.g.: str name=qfheadline_en headline_fr/str However, this means I get no results for oscar and wilde. The parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:and)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord If I add and to the French stopwords list, I *do* get results, and the parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord This implies that the only solution is to have a minimal, shared stopwords list for all languages I want to support. Is this correct, or is there a way of supporting this kind of searching with per-language stopword lists? Thanks for any ideas! Tom
Re: copyField at search time / multi-language support
Tom, to solve this kind of problem, if I understand it well, you could extend the query parser to support something like meta-fields. I'm currently developing a QueryParser Plugin to support a specific syntax. The support of meta-fields to search on different fields (multiple languages) is one of the functionalities that this parser will contain. Ludovic. 2011/3/29 Markus Jelsma-2 [via Lucene] ml-node+2747011-315348515-383...@n3.nabble.com I haven't tried this as an UpdateProcessor but it relies on Tika and that LanguageIdentifier works well, except for short texts. Thanks Markus. Do you know if this patch is good enough for production use? Thanks. Andy --- On Tue, 3/29/11, Markus Jelsma [hidden email]http://user/SendEmail.jtp?type=nodenode=2747011i=0by-user=t wrote: From: Markus Jelsma [hidden email]http://user/SendEmail.jtp?type=nodenode=2747011i=1by-user=t Subject: Re: copyField at search time / multi-language support To: [hidden email]http://user/SendEmail.jtp?type=nodenode=2747011i=2by-user=t Cc: Andy [hidden email]http://user/SendEmail.jtp?type=nodenode=2747011i=3by-user=t Date: Tuesday, March 29, 2011, 1:29 AM https://issues.apache.org/jira/browse/SOLR-1979 Tom, Could you share the method you use to perform language detection? Any open source tools that do that? Thanks. --- On Mon, 3/28/11, Tom Mortimer [hidden email]http://user/SendEmail.jtp?type=nodenode=2747011i=4by-user=t wrote: From: Tom Mortimer [hidden email]http://user/SendEmail.jtp?type=nodenode=2747011i=5by-user=t Subject: copyField at search time / multi-language support To: [hidden email]http://user/SendEmail.jtp?type=nodenode=2747011i=6by-user=t Date: Monday, March 28, 2011, 4:45 AM Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/copyField-at-search-time-multi-language-support-tp2746017p2747011.html To start a new topic under Solr - User, email ml-node+472068-1765922688-383...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/copyField-at-search-time-multi-language-support-tp2746017p2747386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: copyField at search time / multi-language support
This may not be all that helpful, but have you looked at edismax? https://issues.apache.org/jira/browse/SOLR-1553 It allows the full Solr query syntax while preserving the goodness of dismax. This is standard equipment on 3.1, which is being released even as we speak, and I also know it's being used in production situations. If going to 3.1 is not an option, I know people have applied that patch to 1.4.1, but haven't done it myself. Best Erick On Mon, Mar 28, 2011 at 4:45 AM, Tom Mortimer t...@flax.co.uk wrote: Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom
copyField at search time / multi-language support
Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom
Re: copyField at search time / multi-language support
On Mon, Mar 28, 2011 at 2:15 PM, Tom Mortimer t...@flax.co.uk wrote: Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Sorry, unable to understand this. Are you detecting the language, and based on that, indexing to one of mytext_de, mytext_cjk, etc., or does each field have mixed languages? If the former, why could you not also detect the language at query time (or, have separate query sources for users of different languages), and query the appropriate field based on the known language to be searched? Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... [...] This is not possible as far as I know, and would be quite inefficient. Regards, Gora
Re: copyField at search time / multi-language support
Tom, Could you share the method you use to perform language detection? Any open source tools that do that? Thanks. --- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk wrote: From: Tom Mortimer t...@flax.co.uk Subject: copyField at search time / multi-language support To: solr-user@lucene.apache.org Date: Monday, March 28, 2011, 4:45 AM Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom
Re: copyField at search time / multi-language support
https://issues.apache.org/jira/browse/SOLR-1979 Tom, Could you share the method you use to perform language detection? Any open source tools that do that? Thanks. --- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk wrote: From: Tom Mortimer t...@flax.co.uk Subject: copyField at search time / multi-language support To: solr-user@lucene.apache.org Date: Monday, March 28, 2011, 4:45 AM Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom
Re: copyField at search time / multi-language support
Thanks Markus. Do you know if this patch is good enough for production use? Thanks. Andy --- On Tue, 3/29/11, Markus Jelsma markus.jel...@openindex.io wrote: From: Markus Jelsma markus.jel...@openindex.io Subject: Re: copyField at search time / multi-language support To: solr-user@lucene.apache.org Cc: Andy angelf...@yahoo.com Date: Tuesday, March 29, 2011, 1:29 AM https://issues.apache.org/jira/browse/SOLR-1979 Tom, Could you share the method you use to perform language detection? Any open source tools that do that? Thanks. --- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk wrote: From: Tom Mortimer t...@flax.co.uk Subject: copyField at search time / multi-language support To: solr-user@lucene.apache.org Date: Monday, March 28, 2011, 4:45 AM Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom
Re: copyField at search time / multi-language support
I haven't tried this as an UpdateProcessor but it relies on Tika and that LanguageIdentifier works well, except for short texts. Thanks Markus. Do you know if this patch is good enough for production use? Thanks. Andy --- On Tue, 3/29/11, Markus Jelsma markus.jel...@openindex.io wrote: From: Markus Jelsma markus.jel...@openindex.io Subject: Re: copyField at search time / multi-language support To: solr-user@lucene.apache.org Cc: Andy angelf...@yahoo.com Date: Tuesday, March 29, 2011, 1:29 AM https://issues.apache.org/jira/browse/SOLR-1979 Tom, Could you share the method you use to perform language detection? Any open source tools that do that? Thanks. --- On Mon, 3/28/11, Tom Mortimer t...@flax.co.uk wrote: From: Tom Mortimer t...@flax.co.uk Subject: copyField at search time / multi-language support To: solr-user@lucene.apache.org Date: Monday, March 28, 2011, 4:45 AM Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom
Re: Help on Multi-language support
Go with the one doc per title and different fields for each language (title_name_ch, title_name_es) approach. Your application needs to handle what fields to query and return. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 5. mars 2011, at 00.44, cyang2010 wrote: Hi, I wonder how solr can satisfy our multi-language requirement. For example, for movie/tv series titles, We require that based on user preferred language, user is able to get back titles name (and actor, directors) by selected language. For example, getTitlesByGenreId. On the other hand, we also support search by title name, actor or director name. Therefore, how shall i design the solr schema to accomate the requirement? Here is my current schema without consideration of the language: Someone recommended to store as a new doc for the same title in a different language. Then solr doc has a language field to denote what language the doc is for. I don't think this can work, right? Now all those title_name value in different language will share the same index/query analyzer since they are put int the same field. For different language (asci based vs asian language), you will need different parser, right? The way i see this can work is to add extra field for those language related field, such as title_name_ch (chinese), title_name_es (spanish) and etc. Then my application logic need to know what language specific field to query on based on user language preference. What do you think? In summary, shall i go with duplicate doc for title in another language, or shall i just change my schema to accomodate those additional language related fields? Thanks. Your help is appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Help-on-Multi-language-support-tp2636054p2636054.html Sent from the Solr - User mailing list archive at Nabble.com.
Help on Multi-language support
Hi, I wonder how solr can satisfy our multi-language requirement. For example, for movie/tv series titles, We require that based on user preferred language, user is able to get back titles name (and actor, directors) by selected language. For example, getTitlesByGenreId. On the other hand, we also support search by title name, actor or director name. Therefore, how shall i design the solr schema to accomate the requirement? Here is my current schema without consideration of the language: Someone recommended to store as a new doc for the same title in a different language. Then solr doc has a language field to denote what language the doc is for. I don't think this can work, right? Now all those title_name value in different language will share the same index/query analyzer since they are put int the same field. For different language (asci based vs asian language), you will need different parser, right? The way i see this can work is to add extra field for those language related field, such as title_name_ch (chinese), title_name_es (spanish) and etc. Then my application logic need to know what language specific field to query on based on user language preference. What do you think? In summary, shall i go with duplicate doc for title in another language, or shall i just change my schema to accomodate those additional language related fields? Thanks. Your help is appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Help-on-Multi-language-support-tp2636054p2636054.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help on Multi-language support
This is the solr schema: -- View this message in context: http://lucene.472066.n3.nabble.com/Help-on-Multi-language-support-tp2636054p2636065.html Sent from the Solr - User mailing list archive at Nabble.com.
slovene language support
Hi, I want to setup an solr with support for several languages. The language list includes slovene, unfortunately I found nothing about it in the wiki. Has some one experiences with solr 1.4 and slovene? thanks for help Markus
Re: slovene language support
Hello, There is some information here (prototype stemmer) about support in snowball. But Martin Porter had some unanswered questions/reservations so nothing ever got added to snowball: http://snowball.tartarus.org/archives/snowball-discuss/0725.html http://snowball.tartarus.org/archives/snowball-discuss/0725.html Of course you could take that stemmer and generate java code with the snowball code generator and use it, but it seems like it would be best for those issues to get resolved and get it fixed/included in snowball itself... On Mon, Jul 19, 2010 at 10:42 AM, Markus Goldbach markus.goldb...@gmail.com wrote: Hi, I want to setup an solr with support for several languages. The language list includes slovene, unfortunately I found nothing about it in the wiki. Has some one experiences with solr 1.4 and slovene? thanks for help Markus -- Robert Muir rcm...@gmail.com
Polish language support?
In IRC trying to help someone find Polish-language support for Solr. Seems lucene has nothing to offer? Found one stemmer that looks to be compatibly licensed in case someone wants to take a shot at incorporating it: http://www.getopt.org/stempel/ -Peter -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Polish language support?
Hi Peter, this stemmer is integrated into trunk and 3x. http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/ http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/stempel/ On Fri, Jul 9, 2010 at 2:38 PM, Peter Wolanin peter.wola...@acquia.comwrote: In IRC trying to help someone find Polish-language support for Solr. Seems lucene has nothing to offer? Found one stemmer that looks to be compatibly licensed in case someone wants to take a shot at incorporating it: http://www.getopt.org/stempel/ -Peter -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com -- Robert Muir rcm...@gmail.com
Re: Hindi language support in solr
Hi Robert, Thanks for reply. As you write, I used textgen but still not able to search hindi text. Might be missing some important configuration. following is my schema.xml configuration fieldType name=textgen class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fields field name=id type=string indexed=true stored=true required=true / field name=cat_name type=textgen indexed=true stored=true / field name=title type=textgen indexed=true stored=true / field name=summary type=textgen indexed=true stored=true / field name=textgen type=textgen indexed=true stored=false multiValued=true/ /fields uniqueKeyid/uniqueKey defaultSearchFieldtextgen/defaultSearchField solrQueryParser defaultOperator=OR/ copyField source=title dest=textgen/ copyField source=cat_name dest=textgen/ copyField source=summary dest=textgen/ In the summary field there are hindi keywords. Please help.. thanks with regards Ranveer K Kumar On Thu, Jan 21, 2010 at 11:25 PM, Robert Muir rcm...@gmail.com wrote: hello, take a look at field type textgen (a general unstemmed text field) the whitespacetokenizer + worddelimiterfilter used by this type will work correctly for hindi tokenization and punctuation. On Thu, Jan 21, 2010 at 10:55 AM, Ranveer kumar ranveer.k.ku...@gmail.com wrote: Hi all, I am very new in solr. I download latest release 1.4 and install. For Indexing and Searching I am using SolrJ api. My Question is How to enable solr to search hindi language text ?. Please Help me.. thanks with regards Ranveer K Kumar -- Robert Muir rcm...@gmail.com
Hindi language support in solr
Hi all, I am very new in solr. I download latest release 1.4 and install. For Indexing and Searching I am using SolrJ api. My Question is How to enable solr to search hindi language text ?. Please Help me.. thanks with regards Ranveer K Kumar
Re: Hindi language support in solr
hello, take a look at field type textgen (a general unstemmed text field) the whitespacetokenizer + worddelimiterfilter used by this type will work correctly for hindi tokenization and punctuation. On Thu, Jan 21, 2010 at 10:55 AM, Ranveer kumar ranveer.k.ku...@gmail.com wrote: Hi all, I am very new in solr. I download latest release 1.4 and install. For Indexing and Searching I am using SolrJ api. My Question is How to enable solr to search hindi language text ?. Please Help me.. thanks with regards Ranveer K Kumar -- Robert Muir rcm...@gmail.com
Re: Multi language support
right, but we should not encourage users to significantly degrade overall relevance for all movies due to a few movies and a band (very special cases, as I said). In english, by not using stopwords, it doesn't really degrade relevance that much, so its a reasonable decision to make. This is not true in other languages! Instead, systems that worry about all-stopword queries should use CommonGrams. it will work better for these cases, without taking away from overall relevance. On Wed, Jan 13, 2010 at 1:08 AM, Walter Underwood wun...@wunderwood.org wrote: There is a band named The The. And a producer named Don Was. For a list of all-stopword movie titles at Netflix, see this post: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html My favorite is To Be and To Have (Être et Avoir), which is all stopwords in two languages. And a very good movie. wunder On Jan 12, 2010, at 6:55 PM, Robert Muir wrote: sorry, i forgot to include this 2009 paper comparing what stopwords do across 3 languages: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf in my opinion, if stopwords annoy your users for very special cases like 'the the' then, instead consider using commongrams + defaultsimilarity.discountOverlaps = true so that you still get the benefits. as you can see from the above paper, they can be extremely important depending on the language, they just don't matter so much for English. On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote: There are a lot of projects that don't use stopwords any more. You might consider dropping them altogether. On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote: This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Multi language support
Isn't the conclusion here that some stopword and stemming free matching should be the best match if ever and to then gently degrade to weaker forms of matching? paul Le 13-janv.-10 à 07:08, Walter Underwood a écrit : There is a band named The The. And a producer named Don Was. For a list of all-stopword movie titles at Netflix, see this post: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html My favorite is To Be and To Have (Être et Avoir), which is all stopwords in two languages. And a very good movie. wunder On Jan 12, 2010, at 6:55 PM, Robert Muir wrote: sorry, i forgot to include this 2009 paper comparing what stopwords do across 3 languages: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf in my opinion, if stopwords annoy your users for very special cases like 'the the' then, instead consider using commongrams + defaultsimilarity.discountOverlaps = true so that you still get the benefits. as you can see from the above paper, they can be extremely important depending on the language, they just don't matter so much for English. On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote: There are a lot of projects that don't use stopwords any more. You might consider dropping them altogether. On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote: This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Multi language support
Robert Muir: Thank you for the pointer to that paper! On Wed, Jan 13, 2010 at 6:29 AM, Paul Libbrecht p...@activemath.org wrote: Isn't the conclusion here that some stopword and stemming free matching should be the best match if ever and to then gently degrade to weaker forms of matching? paul Le 13-janv.-10 à 07:08, Walter Underwood a écrit : There is a band named The The. And a producer named Don Was. For a list of all-stopword movie titles at Netflix, see this post: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html My favorite is To Be and To Have (Être et Avoir), which is all stopwords in two languages. And a very good movie. wunder On Jan 12, 2010, at 6:55 PM, Robert Muir wrote: sorry, i forgot to include this 2009 paper comparing what stopwords do across 3 languages: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf in my opinion, if stopwords annoy your users for very special cases like 'the the' then, instead consider using commongrams + defaultsimilarity.discountOverlaps = true so that you still get the benefits. as you can see from the above paper, they can be extremely important depending on the language, they just don't matter so much for English. On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote: There are a lot of projects that don't use stopwords any more. You might consider dropping them altogether. On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote: This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com -- Lance Norskog goks...@gmail.com
Re: Multi language support
There are a lot of projects that don't use stopwords any more. You might consider dropping them altogether. On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote: This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel -- Lance Norskog goks...@gmail.com
Re: Multi language support
I don't think this is something to consider across the board for all languages. The same grammatical units that are part of a word in one language (and removed by stemmers) are independent morphemes in others (and should be stopwords) so please take this advice on a case-by-case basis for each language. On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote: There are a lot of projects that don't use stopwords any more. You might consider dropping them altogether. On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote: This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Multi language support
sorry, i forgot to include this 2009 paper comparing what stopwords do across 3 languages: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf in my opinion, if stopwords annoy your users for very special cases like 'the the' then, instead consider using commongrams + defaultsimilarity.discountOverlaps = true so that you still get the benefits. as you can see from the above paper, they can be extremely important depending on the language, they just don't matter so much for English. On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote: There are a lot of projects that don't use stopwords any more. You might consider dropping them altogether. On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote: This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Multi language support
There is a band named The The. And a producer named Don Was. For a list of all-stopword movie titles at Netflix, see this post: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html My favorite is To Be and To Have (Être et Avoir), which is all stopwords in two languages. And a very good movie. wunder On Jan 12, 2010, at 6:55 PM, Robert Muir wrote: sorry, i forgot to include this 2009 paper comparing what stopwords do across 3 languages: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf in my opinion, if stopwords annoy your users for very special cases like 'the the' then, instead consider using commongrams + defaultsimilarity.discountOverlaps = true so that you still get the benefits. as you can see from the above paper, they can be extremely important depending on the language, they just don't matter so much for English. On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote: There are a lot of projects that don't use stopwords any more. You might consider dropping them altogether. On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote: This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com
Multi language support
Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel
Re: Multi language support
Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel
Re: Multi language support
This is the way I've implemented multilingual search as well. 2010/1/11 Markus Jelsma mar...@buyways.nl Hello, We have implemented language specific search in Solr using language specific fields and field types. For instance, an en_text field type can use an English stemmer, and list of stopwords and synonyms. We, however did not use specific stopwords, instead we used one list shared by both languages. So you would have a field type like: fieldType name=en_text class=solr.TextField ... analyzer type= filter class=solr.StopFilterFactory words=stopwords.en.txt filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt etc etc. Cheers, - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote: Hi Solr users. I'm trying to set up a site with Solr search integrated. And I use the SolJava API to feed the index with search documents. At the moment I have only activated search on the English portion of the site. I'm interested in using as many features of solr as possible. Synonyms, Stopwords and stems all sounds quite interesting and useful but how do I set up this in a good way for a multilingual site? The site don't have a huge text mass so performance issues don't really bother me but still I'd like to hear your suggestions before I try to implement an solution. Best regards Daniel
Re: Multi-language support
On Apr 9, 2009, at 7:09 AM, revas wrote: Hi, To reframe my earlier question Some languages have just analyzers only but nostemmer from snowball porter,then does the analyzer take care of stemming as well? Some languages only have the stemmer from snowball but no analyzer? Some have both. Can we say then that solr supports all the above languages .Will search be same across all the above cases? I just responded to the earlier question, but it didn't contain this question. No, I wouldn't say that search would be the same. Stemmed vs. non-stemmed may result in different results, just as one stemmer implementation results will differ from a different stemming approach. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Multi-language support
Hi, To reframe my earlier question Some languages have just analyzers only but nostemmer from snowball porter,then does the analyzer take care of stemming as well? Some languages only have the stemmer from snowball but no analyzer? Some have both. Can we say then that solr supports all the above languages .Will search be same across all the above cases? thanks revas
Re: Multiple language support
Hi, The problem is that a single document (and even a field in your case) is multilingual. Ideally you'd detect different languages within a document and apply a different tokenizer/filter to different parts of the field. So the first part would be handled as EN, and the second part as Chinese. At search time you would have to find the language of the query one way or the other, and again apply the appropriate analyzer. If the right analyzer is applied, you could match even this multilingual field. None of the existing Analyzers/tokenizers/filters are capable of handling a single piece of text in multiple languages, so you will have to create a custom analyzer that is smart enough to do that. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Deshpande, Mukta mudes...@ptc.com To: solr-user@lucene.apache.org Sent: Monday, December 29, 2008 4:52:19 AM Subject: Multiple language support Hi All, I have a multiple language supporting schema in which there is a separate field for every language. I have a field product_name to store product name and its description that can be in any user preferred language. This can be stored in fields product_name_EN if user prefers English language, product_name_SCH if user prefers Simplified Chinese language. The WhitespaceTokenizerFactory and filter EnglishPorterFilterFactory are applied on product_name_EN. The CJKAnalyzer and CJKTokenizer are applied on product_name_SCH. e.g. Value can be : ElectrolyticCapacitor - 被对立的电容器以价值220µF Now my problem is: Which field do I store the above value? product_name_EN OR product_name_SCH OR should it be something else? How do I find out which analyzers should get applied for this field. Did any one face a similar situation before. Please help ASAP. Thanks, ~Mukta
RE: Language support
You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote: You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. I can do all of those. This implies storing all of the different languages in different fields, right? Then changing the default search- field to the language of the query for every query? On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote: You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/ msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
You can store in one field if you manage to hide a language code with the text. XML is overkill but effective for this. At one point, we'd investigated how to allow a Lucene analyzer to see more than one field (the language code as well as the text) but I don't think we came up with anything. On Thu, Mar 20, 2008 at 12:39 PM, David King [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. I can do all of those. This implies storing all of the different languages in different fields, right? Then changing the default search- field to the language of the query for every query? On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote: You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/ msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
Token/by/token seems a bit extreme. Are you concerned with macaronic documents? On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED] wrote: Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: die and boot have different meanings in German and English. Proper nouns (Laserjet) may be the same in all languages, a different problem if you are trying to get answers in one language. At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index de/Boot or en/Laserjet. wunder On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer.
Re: Language support
Extreme, but guaranteed to work and it avoids bad IDF when there are inter-language collisions. In Ultraseek, we only stored the hash, so the size of the source token didn't matter. Trademarks are a bad source of collisions and anomalous IDF. If you have LaserJet support docs in 20 languages, the term LaserJet will have a document frequency 20X higher than the terms in a single language and will score too low. Ultraseek handles macaronic documents when the script makes it possible, for example, roman is sent to the English stemmer in a Japanese document, Hangul always goes to the Korean segmenter/stemmer. A simpler approach is to tag each document with a language, like lang:de, then use a filter query to restrict the documents to the query language. Per-token tagging still strikes me as the right approach. It makes all sorts of things work, like keeping fuzzy matches within the same language. We didn't do it in Ultraseek because it would have been an incompatible index change and the benefit didn't justify that. wunder == Walter Underwood Former Ultraseek Architect Current Entire Netflix Search Department On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote: Token/by/token seems a bit extreme. Are you concerned with macaronic documents? On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED] wrote: Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: die and boot have different meanings in German and English. Proper nouns (Laserjet) may be the same in all languages, a different problem if you are trying to get answers in one language. At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index de/Boot or en/Laserjet. wunder On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer.
Re: Language support
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis. All that makes sense. On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood [EMAIL PROTECTED] wrote: Extreme, but guaranteed to work and it avoids bad IDF when there are inter-language collisions. In Ultraseek, we only stored the hash, so the size of the source token didn't matter. Trademarks are a bad source of collisions and anomalous IDF. If you have LaserJet support docs in 20 languages, the term LaserJet will have a document frequency 20X higher than the terms in a single language and will score too low. Ultraseek handles macaronic documents when the script makes it possible, for example, roman is sent to the English stemmer in a Japanese document, Hangul always goes to the Korean segmenter/stemmer. A simpler approach is to tag each document with a language, like lang:de, then use a filter query to restrict the documents to the query language. Per-token tagging still strikes me as the right approach. It makes all sorts of things work, like keeping fuzzy matches within the same language. We didn't do it in Ultraseek because it would have been an incompatible index change and the benefit didn't justify that. wunder == Walter Underwood Former Ultraseek Architect Current Entire Netflix Search Department On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote: Token/by/token seems a bit extreme. Are you concerned with macaronic documents? On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED] wrote: Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: die and boot have different meanings in German and English. Proper nouns (Laserjet) may be the same in all languages, a different problem if you are trying to get answers in one language. At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index de/Boot or en/Laserjet. wunder On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer.
Language support
This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?