Hi Alex, There is no specific language list. For example: the documents that needs to be indexed are emails or any messages for a global customer base. The messages back and forth could be in any language or mix of languages. I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Now it would be great if it had capability to tokenize email addresses (ex:he...@aol.com- i think standardTokenizer already does this), filenames (здравствуйте.pdf), but maybe we can use filters to accomplish that. Thanks, Rishi. -----Original Message----- From: Alexandre Rafalovitch <arafa...@gmail.com> To: solr-user <solr-user@lucene.apache.org> Sent: Mon, Feb 23, 2015 5:49 pm Subject: Re: Basic Multilingual search capability Which languages are you expecting to deal with? Multilingual support is a complex issue. Even if you think you don't need much, it is usually a lot more complex than expected, especially around relevancy. Regards, Alex. ---- Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 23 February 2015 at 16:19, Rishi Easwaran <rishi.easwa...@aol.com> wrote: > Hi All, > > For our use case we don't really need to do a lot of manipulation of incoming text during index time. At most removal of common stop words, tokenize emails/ filenames etc if possible. We get text documents from our end users, which can be in any language (sometimes combination) and we cannot determine the language of the incoming text. Language detection at index time is not necessary. > > Which analyzer is recommended to achive basic multilingual search capability for a use case like this. > I have read a bunch of posts about using a combination standardtokenizer or ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for ideas, suggestions, best practices. > > http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 > http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 > https://issues.apache.org/jira/browse/SOLR-6492 > > > Thanks, > Rishi. >