Re: Basic Multilingual search capability
Hi Tom, Thanks for your inputs. I was planning to use stopword filter, but will definitely make sure they are unique and not to step over each other. I think for our system even going with length of 50-75 should be fine, will definitely up that number after doing some analysis on our input. Just one clarification, when you say ICUFilterFactory am I correct in thinking its ICUFodingFilterFactory. Thanks, Rishi. -Original Message- From: Tom Burton-West tburt...@umich.edu To: solr-user solr-user@lucene.apache.org Sent: Wed, Feb 25, 2015 4:33 pm Subject: Re: Basic Multilingual search capability Hi Rishi, As others have indicated Multilingual search is very difficult to do well. At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to deal with having materials in 400 languages. We also added the CJKBigramFilter to get better precision on CJK queries. We don't use stop words because stop words in one language are content words in another. For example die in German is a stopword but it is a content word in English. Putting multiple languages in one index can affect word frequency statistics which make relevance ranking less accurate. So for example for the English query Die Hard the word die would get a low idf score because it occurs so frequently in German. We realize that our approach does not produce the best results, but given the 400 languages, and limited resources, we do our best to make search not suck for non-English languages. When we have the resources we are thinking about doing special processing for a small fraction of the top 20 languages. We plan to select those languages that most need special processing and relatively easy to disambiguate from other languages. If you plan on identifying languages (rather than scripts), you should be aware that most language detection libraries don't work well on short texts such as queries. If you know that you have scripts for which you have content in only one language, you can use script detection instead of language detection. If you have German, a filter length of 25 might be too low (Because of compounding). You might want to analyze a sample of your German text to find a good length. Tom http://www.hathitrust.org/blogs/Large-scale-Search On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, Thanks for the suggestions. These steps will definitely help out with our use case. Thanks for the idea about the lengthFilter to protect our system. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 8:50 am Subject: Re: Basic Multilingual search capability Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote: I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Re: Basic Multilingual search capability
Hi Trey, Thanks for the detailed response and the link to the talk, it was very informative. Yes looking at the current system requirements ICUTokenizer might be the best bet for our use case. MultiTextField mentioned in the jira SOLR-6492 has some cool features and definitely looking forward to trying out once its integrated to main. Thanks, Rishi. -Original Message- From: Trey Grainger solrt...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 1:40 am Subject: Re: Basic Multilingual search capability Hi Rishi, I don't generally recommend a language-insensitive approach except for really simple multilingual use cases (for most of the reasons Walter mentioned), but the ICUTokenizer is probably the best bet you're going to have if you really want to go that route and only need exact-match on the tokens that are parsed. It won't work that well for all languages (CJK languages, for example), but it will work fine for many. It is also possible to handle multi-lingual content in a more intelligent (i.e. per-language configuration) way in your search index, of course. There are three primary strategies (i.e. ways that actually work in the real world) to do this: 1) create a separate field for each language and search across all of them at query time 2) create a separate core per language-combination and search across all of them at query time 3) invoke multiple language-specific analyzers within a single field's analyzer and index/query using one or more of those language's analyzers for each document/query. These are listed in ascending order of complexity, and each can be valid based upon your use case. For at least the first and third cases, you can use index-time language detection to map to the appropriate fields/analyzers if you are otherwise unaware of the languages of the content from your application layer. The third option requires custom code (included in the large Multilingual Search chapter of Solr in Action http://solrinaction.com and soon to be contributed back to Solr via SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it enables you to index an arbitrarily large number of languages into the same field if needed, while preserving language-specific analysis for each language. I presented in detail on the above strategies at Lucene/Solr Revolution last November, so you may consider checking out the presentation and/or slides to asses if one of these strategies will work for your use case: http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/ For the record, I'd highly recommend going with the first strategy (a separate field per language) if you can, as it is certainly the simplest of the approaches (albeit the one that scales the least well after you add more than a few languages to your queries). If you want to stay simple and stick with the ICUTokenizer then it will work to a point, but some of the problems Walter mentioned may eventually bite you if you are supporting certain groups of languages. All the best, Trey Grainger Co-author, Solr in Action Director of Engineering, Search Recommendations @ CareerBuilder On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org wrote: It isn’t just complicated, it can be impossible. Do you have content in Chinese or Japanese? Those languages (and some others) do not separate words with spaces. You cannot even do word search without a language-specific, dictionary-based parser. German is space separated, except many noun compounds are not space-separated. Do you have Finnish content? Entire prepositional phrases turn into word endings. Do you have Arabic content? That is even harder. If all your content is in space-separated languages that are not heavily inflected, you can kind of do OK with a language-insensitive approach. But it hits the wall pretty fast. One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). Those are spelled the same in all languages and usually not inflected. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, There is no specific language list. For example: the documents that needs to be indexed are emails or any messages for a global customer base. The messages back and forth could be in any language or mix of languages. I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results. Now it would be great if it had capability to tokenize email addresses (ex:he...@aol.com- i think standardTokenizer already does this), filenames (здравствуйте.pdf
Re: Basic Multilingual search capability
Hi Alex, Thanks for the suggestions. These steps will definitely help out with our use case. Thanks for the idea about the lengthFilter to protect our system. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 8:50 am Subject: Re: Basic Multilingual search capability Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote: I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Re: Basic Multilingual search capability
Hi Rishi, As others have indicated Multilingual search is very difficult to do well. At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to deal with having materials in 400 languages. We also added the CJKBigramFilter to get better precision on CJK queries. We don't use stop words because stop words in one language are content words in another. For example die in German is a stopword but it is a content word in English. Putting multiple languages in one index can affect word frequency statistics which make relevance ranking less accurate. So for example for the English query Die Hard the word die would get a low idf score because it occurs so frequently in German. We realize that our approach does not produce the best results, but given the 400 languages, and limited resources, we do our best to make search not suck for non-English languages. When we have the resources we are thinking about doing special processing for a small fraction of the top 20 languages. We plan to select those languages that most need special processing and relatively easy to disambiguate from other languages. If you plan on identifying languages (rather than scripts), you should be aware that most language detection libraries don't work well on short texts such as queries. If you know that you have scripts for which you have content in only one language, you can use script detection instead of language detection. If you have German, a filter length of 25 might be too low (Because of compounding). You might want to analyze a sample of your German text to find a good length. Tom http://www.hathitrust.org/blogs/Large-scale-Search On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, Thanks for the suggestions. These steps will definitely help out with our use case. Thanks for the idea about the lengthFilter to protect our system. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 8:50 am Subject: Re: Basic Multilingual search capability Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote: I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Re: Basic Multilingual search capability
Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote: I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Basic Multilingual search capability
Hi All, For our use case we don't really need to do a lot of manipulation of incoming text during index time. At most removal of common stop words, tokenize emails/ filenames etc if possible. We get text documents from our end users, which can be in any language (sometimes combination) and we cannot determine the language of the incoming text. Language detection at index time is not necessary. Which analyzer is recommended to achive basic multilingual search capability for a use case like this. I have read a bunch of posts about using a combination standardtokenizer or ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for ideas, suggestions, best practices. http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 https://issues.apache.org/jira/browse/SOLR-6492 Thanks, Rishi.
Re: Basic Multilingual search capability
Which languages are you expecting to deal with? Multilingual support is a complex issue. Even if you think you don't need much, it is usually a lot more complex than expected, especially around relevancy. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, For our use case we don't really need to do a lot of manipulation of incoming text during index time. At most removal of common stop words, tokenize emails/ filenames etc if possible. We get text documents from our end users, which can be in any language (sometimes combination) and we cannot determine the language of the incoming text. Language detection at index time is not necessary. Which analyzer is recommended to achive basic multilingual search capability for a use case like this. I have read a bunch of posts about using a combination standardtokenizer or ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for ideas, suggestions, best practices. http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 https://issues.apache.org/jira/browse/SOLR-6492 Thanks, Rishi.
Re: Basic Multilingual search capability
It isn’t just complicated, it can be impossible. Do you have content in Chinese or Japanese? Those languages (and some others) do not separate words with spaces. You cannot even do word search without a language-specific, dictionary-based parser. German is space separated, except many noun compounds are not space-separated. Do you have Finnish content? Entire prepositional phrases turn into word endings. Do you have Arabic content? That is even harder. If all your content is in space-separated languages that are not heavily inflected, you can kind of do OK with a language-insensitive approach. But it hits the wall pretty fast. One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). Those are spelled the same in all languages and usually not inflected. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, There is no specific language list. For example: the documents that needs to be indexed are emails or any messages for a global customer base. The messages back and forth could be in any language or mix of languages. I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results. Now it would be great if it had capability to tokenize email addresses (ex:he...@aol.com- i think standardTokenizer already does this), filenames (здравствуйте.pdf), but maybe we can use filters to accomplish that. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Mon, Feb 23, 2015 5:49 pm Subject: Re: Basic Multilingual search capability Which languages are you expecting to deal with? Multilingual support is a complex issue. Even if you think you don't need much, it is usually a lot more complex than expected, especially around relevancy. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, For our use case we don't really need to do a lot of manipulation of incoming text during index time. At most removal of common stop words, tokenize emails/ filenames etc if possible. We get text documents from our end users, which can be in any language (sometimes combination) and we cannot determine the language of the incoming text. Language detection at index time is not necessary. Which analyzer is recommended to achive basic multilingual search capability for a use case like this. I have read a bunch of posts about using a combination standardtokenizer or ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for ideas, suggestions, best practices. http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 https://issues.apache.org/jira/browse/SOLR-6492 Thanks, Rishi.
Re: Basic Multilingual search capability
Hi Alex, There is no specific language list. For example: the documents that needs to be indexed are emails or any messages for a global customer base. The messages back and forth could be in any language or mix of languages. I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results. Now it would be great if it had capability to tokenize email addresses (ex:he...@aol.com- i think standardTokenizer already does this), filenames (здравствуйте.pdf), but maybe we can use filters to accomplish that. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Mon, Feb 23, 2015 5:49 pm Subject: Re: Basic Multilingual search capability Which languages are you expecting to deal with? Multilingual support is a complex issue. Even if you think you don't need much, it is usually a lot more complex than expected, especially around relevancy. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, For our use case we don't really need to do a lot of manipulation of incoming text during index time. At most removal of common stop words, tokenize emails/ filenames etc if possible. We get text documents from our end users, which can be in any language (sometimes combination) and we cannot determine the language of the incoming text. Language detection at index time is not necessary. Which analyzer is recommended to achive basic multilingual search capability for a use case like this. I have read a bunch of posts about using a combination standardtokenizer or ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for ideas, suggestions, best practices. http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 https://issues.apache.org/jira/browse/SOLR-6492 Thanks, Rishi.
Re: Basic Multilingual search capability
Hi Wunder, Yes we do expect incoming documents to contain Chinese/Japanese/Arabic languages. From what you have mentioned, it looks like we need to auto detect the incoming content language and tokenize/filter after that. But I thought the ICU tokenizer had capability to do that (https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer) This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute. or am I missing something? Thanks, Rishi. -Original Message- From: Walter Underwood wun...@wunderwood.org To: solr-user solr-user@lucene.apache.org Sent: Mon, Feb 23, 2015 11:17 pm Subject: Re: Basic Multilingual search capability It isn’t just complicated, it can be impossible. Do you have content in Chinese or Japanese? Those languages (and some others) do not separate words with spaces. You cannot even do word search without a language-specific, dictionary-based parser. German is space separated, except many noun compounds are not space-separated. Do you have Finnish content? Entire prepositional phrases turn into word endings. Do you have Arabic content? That is even harder. If all your content is in space-separated languages that are not heavily inflected, you can kind of do OK with a language-insensitive approach. But it hits the wall pretty fast. One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). Those are spelled the same in all languages and usually not inflected. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, There is no specific language list. For example: the documents that needs to be indexed are emails or any messages for a global customer base. The messages back and forth could be in any language or mix of languages. I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results. Now it would be great if it had capability to tokenize email addresses (ex:he...@aol.com- i think standardTokenizer already does this), filenames (здравствуйте.pdf), but maybe we can use filters to accomplish that. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Mon, Feb 23, 2015 5:49 pm Subject: Re: Basic Multilingual search capability Which languages are you expecting to deal with? Multilingual support is a complex issue. Even if you think you don't need much, it is usually a lot more complex than expected, especially around relevancy. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, For our use case we don't really need to do a lot of manipulation of incoming text during index time. At most removal of common stop words, tokenize emails/ filenames etc if possible. We get text documents from our end users, which can be in any language (sometimes combination) and we cannot determine the language of the incoming text. Language detection at index time is not necessary. Which analyzer is recommended to achive basic multilingual search capability for a use case like this. I have read a bunch of posts about using a combination standardtokenizer or ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for ideas, suggestions, best practices. http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 https://issues.apache.org/jira/browse/SOLR-6492 Thanks, Rishi.
Re: Basic Multilingual search capability
Hi Rishi, I don't generally recommend a language-insensitive approach except for really simple multilingual use cases (for most of the reasons Walter mentioned), but the ICUTokenizer is probably the best bet you're going to have if you really want to go that route and only need exact-match on the tokens that are parsed. It won't work that well for all languages (CJK languages, for example), but it will work fine for many. It is also possible to handle multi-lingual content in a more intelligent (i.e. per-language configuration) way in your search index, of course. There are three primary strategies (i.e. ways that actually work in the real world) to do this: 1) create a separate field for each language and search across all of them at query time 2) create a separate core per language-combination and search across all of them at query time 3) invoke multiple language-specific analyzers within a single field's analyzer and index/query using one or more of those language's analyzers for each document/query. These are listed in ascending order of complexity, and each can be valid based upon your use case. For at least the first and third cases, you can use index-time language detection to map to the appropriate fields/analyzers if you are otherwise unaware of the languages of the content from your application layer. The third option requires custom code (included in the large Multilingual Search chapter of Solr in Action http://solrinaction.com and soon to be contributed back to Solr via SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it enables you to index an arbitrarily large number of languages into the same field if needed, while preserving language-specific analysis for each language. I presented in detail on the above strategies at Lucene/Solr Revolution last November, so you may consider checking out the presentation and/or slides to asses if one of these strategies will work for your use case: http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/ For the record, I'd highly recommend going with the first strategy (a separate field per language) if you can, as it is certainly the simplest of the approaches (albeit the one that scales the least well after you add more than a few languages to your queries). If you want to stay simple and stick with the ICUTokenizer then it will work to a point, but some of the problems Walter mentioned may eventually bite you if you are supporting certain groups of languages. All the best, Trey Grainger Co-author, Solr in Action Director of Engineering, Search Recommendations @ CareerBuilder On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org wrote: It isn’t just complicated, it can be impossible. Do you have content in Chinese or Japanese? Those languages (and some others) do not separate words with spaces. You cannot even do word search without a language-specific, dictionary-based parser. German is space separated, except many noun compounds are not space-separated. Do you have Finnish content? Entire prepositional phrases turn into word endings. Do you have Arabic content? That is even harder. If all your content is in space-separated languages that are not heavily inflected, you can kind of do OK with a language-insensitive approach. But it hits the wall pretty fast. One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). Those are spelled the same in all languages and usually not inflected. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, There is no specific language list. For example: the documents that needs to be indexed are emails or any messages for a global customer base. The messages back and forth could be in any language or mix of languages. I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results. Now it would be great if it had capability to tokenize email addresses (ex:he...@aol.com- i think standardTokenizer already does this), filenames (здравствуйте.pdf), but maybe we can use filters to accomplish that. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Mon, Feb 23, 2015 5:49 pm Subject: Re: Basic Multilingual search capability Which languages are you expecting to deal with? Multilingual support is a complex issue. Even if you think you don't need much, it is usually a lot more complex than expected, especially around relevancy. Regards, Alex. Sign up for my Solr resources