Re: Basic Multilingual search capability

2015-02-26 Thread Rishi Easwaran
Hi Tom,

Thanks for your inputs. 
I was planning to use stopword filter, but will definitely make sure they are 
unique and not to step over each other.  I think for our system even going with 
length of 50-75 should be fine, will definitely up that number after doing some 
analysis on our input.
Just one clarification, when you say ICUFilterFactory am I correct in thinking 
its ICUFodingFilterFactory.
 
Thanks,
Rishi.

 

 

-Original Message-
From: Tom Burton-West tburt...@umich.edu
To: solr-user solr-user@lucene.apache.org
Sent: Wed, Feb 25, 2015 4:33 pm
Subject: Re: Basic Multilingual search capability


Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example die in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query Die Hard the word die would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search not suck for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.


If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.


If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search


On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com
wrote:

 Hi Alex,

 Thanks for the suggestions. These steps will definitely help out with our
 use case.
 Thanks for the idea about the lengthFilter to protect our system.

 Thanks,
 Rishi.







 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Feb 24, 2015 8:50 am
 Subject: Re: Basic Multilingual search capability


 Given the limited needs, I would probably do something like this:

 1) Put a language identifier in the UpdateRequestProcessor chain
 during indexing and route out at least known problematic languages,
 such as Chinese, Japanese, Arabic into individual fields
 2) Put everything else together into one field with ICUTokenizer,
 maybe also ICUFoldingFilter
 3) At the very end of that joint filter, stick in LengthFilter with
 some high number, e.g. 25 characters max. This will ensure that
 super-long words from non-space languages and edge conditions do not
 break the rest of your system.


 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org
 wrote:
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide
 basic search capability for any language. Ex: When the document contains
 hello
 or здравствуйте, the analyzer creates tokens and provides exact match
 search
 results.




 


Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran


Hi Trey,

Thanks for the detailed response and the link to the talk, it was very 
informative.
Yes looking at the current system requirements ICUTokenizer might be the best 
bet for our use case.
MultiTextField mentioned in the jira SOLR-6492 has some cool features and 
definitely looking forward to trying out once its integrated to main.

 
Thanks,
Rishi.

 

 

-Original Message-
From: Trey Grainger solrt...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Feb 24, 2015 1:40 am
Subject: Re: Basic Multilingual search capability


Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
http://solrinaction.com and soon to be contributed back to Solr via
SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org
wrote:

 It isn’t just complicated, it can be impossible.

 Do you have content in Chinese or Japanese? Those languages (and some
 others) do not separate words with spaces. You cannot even do word search
 without a language-specific, dictionary-based parser.

 German is space separated, except many noun compounds are not
 space-separated.

 Do you have Finnish content? Entire prepositional phrases turn into word
 endings.

 Do you have Arabic content? That is even harder.

 If all your content is in space-separated languages that are not heavily
 inflected, you can kind of do OK with a language-insensitive approach. But
 it hits the wall pretty fast.

 One thing that does work pretty well is trademarked names (LaserJet, Coke,
 etc). Those are spelled the same in all languages and usually not inflected.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com
 wrote:

  Hi Alex,
 
  There is no specific language list.
  For example: the documents that needs to be indexed are emails or any
 messages for a global customer base. The messages back and forth could be
 in any language or mix of languages.
 
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide basic search capability for any language. Ex: When the document
 contains hello or здравствуйте, the analyzer creates tokens and provides
 exact match search results.
 
  Now it would be great if it had capability to tokenize email addresses
 (ex:he...@aol.com- i think standardTokenizer already does this),
 filenames (здравствуйте.pdf

Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran
Hi Alex,

Thanks for the suggestions. These steps will definitely help out with our use 
case.
Thanks for the idea about the lengthFilter to protect our system.

Thanks,
Rishi.

 

 

 

-Original Message-
From: Alexandre Rafalovitch arafa...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Feb 24, 2015 8:50 am
Subject: Re: Basic Multilingual search capability


Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.


Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote:
 I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

 


Re: Basic Multilingual search capability

2015-02-25 Thread Tom Burton-West
Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example die in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query Die Hard the word die would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search not suck for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.


If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.


If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search


On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com
wrote:

 Hi Alex,

 Thanks for the suggestions. These steps will definitely help out with our
 use case.
 Thanks for the idea about the lengthFilter to protect our system.

 Thanks,
 Rishi.







 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Feb 24, 2015 8:50 am
 Subject: Re: Basic Multilingual search capability


 Given the limited needs, I would probably do something like this:

 1) Put a language identifier in the UpdateRequestProcessor chain
 during indexing and route out at least known problematic languages,
 such as Chinese, Japanese, Arabic into individual fields
 2) Put everything else together into one field with ICUTokenizer,
 maybe also ICUFoldingFilter
 3) At the very end of that joint filter, stick in LengthFilter with
 some high number, e.g. 25 characters max. This will ensure that
 super-long words from non-space languages and edge conditions do not
 break the rest of your system.


 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org
 wrote:
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide
 basic search capability for any language. Ex: When the document contains
 hello
 or здравствуйте, the analyzer creates tokens and provides exact match
 search
 results.





Re: Basic Multilingual search capability

2015-02-24 Thread Alexandre Rafalovitch
Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.


Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote:
 I understand relevancy, stemming etc becomes extremely complicated with 
 multilingual support, but our first goal is to be able to tokenize and 
 provide basic search capability for any language. Ex: When the document 
 contains hello or здравствуйте, the analyzer creates tokens and provides 
 exact match search results.


Basic Multilingual search capability

2015-02-23 Thread Rishi Easwaran
Hi All,

For our use case we don't really need to do a lot of manipulation of incoming 
text during index time. At most removal of common stop words, tokenize emails/ 
filenames etc if possible. We get text documents from our end users, which can 
be in any language (sometimes combination) and we cannot determine the language 
of the incoming text. Language detection at index time is not necessary.

Which analyzer is recommended to achive basic multilingual search capability 
for a use case like this.
I have read a bunch of posts about using a combination standardtokenizer or 
ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for 
ideas, suggestions, best practices.

http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
https://issues.apache.org/jira/browse/SOLR-6492  

 
Thanks,
Rishi.
 


Re: Basic Multilingual search capability

2015-02-23 Thread Alexandre Rafalovitch
Which languages are you expecting to deal with? Multilingual support
is a complex issue. Even if you think you don't need much, it is
usually a lot more complex than expected, especially around relevancy.

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote:
 Hi All,

 For our use case we don't really need to do a lot of manipulation of incoming 
 text during index time. At most removal of common stop words, tokenize 
 emails/ filenames etc if possible. We get text documents from our end users, 
 which can be in any language (sometimes combination) and we cannot determine 
 the language of the incoming text. Language detection at index time is not 
 necessary.

 Which analyzer is recommended to achive basic multilingual search capability 
 for a use case like this.
 I have read a bunch of posts about using a combination standardtokenizer or 
 ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking 
 for ideas, suggestions, best practices.

 http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
 https://issues.apache.org/jira/browse/SOLR-6492


 Thanks,
 Rishi.



Re: Basic Multilingual search capability

2015-02-23 Thread Walter Underwood
It isn’t just complicated, it can be impossible.

Do you have content in Chinese or Japanese? Those languages (and some others) 
do not separate words with spaces. You cannot even do word search without a 
language-specific, dictionary-based parser.

German is space separated, except many noun compounds are not space-separated.

Do you have Finnish content? Entire prepositional phrases turn into word 
endings.

Do you have Arabic content? That is even harder.

If all your content is in space-separated languages that are not heavily 
inflected, you can kind of do OK with a language-insensitive approach. But it 
hits the wall pretty fast.

One thing that does work pretty well is trademarked names (LaserJet, Coke, 
etc). Those are spelled the same in all languages and usually not inflected.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 Hi Alex,
 
 There is no specific language list.  
 For example: the documents that needs to be indexed are emails or any 
 messages for a global customer base. The messages back and forth could be in 
 any language or mix of languages.
 
 I understand relevancy, stemming etc becomes extremely complicated with 
 multilingual support, but our first goal is to be able to tokenize and 
 provide basic search capability for any language. Ex: When the document 
 contains hello or здравствуйте, the analyzer creates tokens and provides 
 exact match search results.
 
 Now it would be great if it had capability to tokenize email addresses 
 (ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
 (здравствуйте.pdf), but maybe we can use filters to accomplish that. 
 
 Thanks,
 Rishi.
 
 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Mon, Feb 23, 2015 5:49 pm
 Subject: Re: Basic Multilingual search capability
 
 
 Which languages are you expecting to deal with? Multilingual support
 is a complex issue. Even if you think you don't need much, it is
 usually a lot more complex than expected, especially around relevancy.
 
 Regards,
   Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/
 
 
 On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote:
 Hi All,
 
 For our use case we don't really need to do a lot of manipulation of 
 incoming 
 text during index time. At most removal of common stop words, tokenize 
 emails/ 
 filenames etc if possible. We get text documents from our end users, which 
 can 
 be in any language (sometimes combination) and we cannot determine the 
 language 
 of the incoming text. Language detection at index time is not necessary.
 
 Which analyzer is recommended to achive basic multilingual search capability 
 for a use case like this.
 I have read a bunch of posts about using a combination standardtokenizer or 
 ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking 
 for 
 ideas, suggestions, best practices.
 
 http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
 https://issues.apache.org/jira/browse/SOLR-6492
 
 
 Thanks,
 Rishi.
 
 
 



Re: Basic Multilingual search capability

2015-02-23 Thread Rishi Easwaran
Hi Alex,

There is no specific language list.  
For example: the documents that needs to be indexed are emails or any messages 
for a global customer base. The messages back and forth could be in any 
language or mix of languages.
 
I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

Now it would be great if it had capability to tokenize email addresses 
(ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 

Thanks,
Rishi.
 
 
-Original Message-
From: Alexandre Rafalovitch arafa...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Feb 23, 2015 5:49 pm
Subject: Re: Basic Multilingual search capability


Which languages are you expecting to deal with? Multilingual support
is a complex issue. Even if you think you don't need much, it is
usually a lot more complex than expected, especially around relevancy.

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote:
 Hi All,

 For our use case we don't really need to do a lot of manipulation of incoming 
text during index time. At most removal of common stop words, tokenize emails/ 
filenames etc if possible. We get text documents from our end users, which can 
be in any language (sometimes combination) and we cannot determine the language 
of the incoming text. Language detection at index time is not necessary.

 Which analyzer is recommended to achive basic multilingual search capability 
for a use case like this.
 I have read a bunch of posts about using a combination standardtokenizer or 
ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for 
ideas, suggestions, best practices.

 http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
 https://issues.apache.org/jira/browse/SOLR-6492


 Thanks,
 Rishi.


 


Re: Basic Multilingual search capability

2015-02-23 Thread Rishi Easwaran
Hi Wunder,

Yes we do expect incoming documents to contain Chinese/Japanese/Arabic 
languages.

From what you have mentioned, it looks like we need to auto detect the incoming 
content language and tokenize/filter after that.
But I thought the ICU tokenizer had capability to do that  
(https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer)
This tokenizer processes multilingual text and tokenizes it appropriately 
based on its script attribute. 
or am I missing something? 

Thanks,
Rishi.

 

 

-Original Message-
From: Walter Underwood wun...@wunderwood.org
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Feb 23, 2015 11:17 pm
Subject: Re: Basic Multilingual search capability


It isn’t just complicated, it can be impossible.

Do you have content in Chinese or Japanese? Those languages (and some others) 
do 
not separate words with spaces. You cannot even do word search without a 
language-specific, dictionary-based parser.

German is space separated, except many noun compounds are not space-separated.

Do you have Finnish content? Entire prepositional phrases turn into word 
endings.

Do you have Arabic content? That is even harder.

If all your content is in space-separated languages that are not heavily 
inflected, you can kind of do OK with a language-insensitive approach. But it 
hits the wall pretty fast.

One thing that does work pretty well is trademarked names (LaserJet, Coke, 
etc). 
Those are spelled the same in all languages and usually not inflected.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 Hi Alex,
 
 There is no specific language list.  
 For example: the documents that needs to be indexed are emails or any 
 messages 
for a global customer base. The messages back and forth could be in any 
language 
or mix of languages.
 
 I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.
 
 Now it would be great if it had capability to tokenize email addresses 
(ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 
 
 Thanks,
 Rishi.
 
 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Mon, Feb 23, 2015 5:49 pm
 Subject: Re: Basic Multilingual search capability
 
 
 Which languages are you expecting to deal with? Multilingual support
 is a complex issue. Even if you think you don't need much, it is
 usually a lot more complex than expected, especially around relevancy.
 
 Regards,
   Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/
 
 
 On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote:
 Hi All,
 
 For our use case we don't really need to do a lot of manipulation of 
 incoming 

 text during index time. At most removal of common stop words, tokenize 
 emails/ 

 filenames etc if possible. We get text documents from our end users, which 
 can 

 be in any language (sometimes combination) and we cannot determine the 
language 
 of the incoming text. Language detection at index time is not necessary.
 
 Which analyzer is recommended to achive basic multilingual search capability 
 for a use case like this.
 I have read a bunch of posts about using a combination standardtokenizer or 
 ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking 
 for 

 ideas, suggestions, best practices.
 
 http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
 https://issues.apache.org/jira/browse/SOLR-6492
 
 
 Thanks,
 Rishi.
 
 
 


 


Re: Basic Multilingual search capability

2015-02-23 Thread Trey Grainger
Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
http://solrinaction.com and soon to be contributed back to Solr via
SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org
wrote:

 It isn’t just complicated, it can be impossible.

 Do you have content in Chinese or Japanese? Those languages (and some
 others) do not separate words with spaces. You cannot even do word search
 without a language-specific, dictionary-based parser.

 German is space separated, except many noun compounds are not
 space-separated.

 Do you have Finnish content? Entire prepositional phrases turn into word
 endings.

 Do you have Arabic content? That is even harder.

 If all your content is in space-separated languages that are not heavily
 inflected, you can kind of do OK with a language-insensitive approach. But
 it hits the wall pretty fast.

 One thing that does work pretty well is trademarked names (LaserJet, Coke,
 etc). Those are spelled the same in all languages and usually not inflected.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com
 wrote:

  Hi Alex,
 
  There is no specific language list.
  For example: the documents that needs to be indexed are emails or any
 messages for a global customer base. The messages back and forth could be
 in any language or mix of languages.
 
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide basic search capability for any language. Ex: When the document
 contains hello or здравствуйте, the analyzer creates tokens and provides
 exact match search results.
 
  Now it would be great if it had capability to tokenize email addresses
 (ex:he...@aol.com- i think standardTokenizer already does this),
 filenames (здравствуйте.pdf), but maybe we can use filters to accomplish
 that.
 
  Thanks,
  Rishi.
 
  -Original Message-
  From: Alexandre Rafalovitch arafa...@gmail.com
  To: solr-user solr-user@lucene.apache.org
  Sent: Mon, Feb 23, 2015 5:49 pm
  Subject: Re: Basic Multilingual search capability
 
 
  Which languages are you expecting to deal with? Multilingual support
  is a complex issue. Even if you think you don't need much, it is
  usually a lot more complex than expected, especially around relevancy.
 
  Regards,
Alex.
  
  Sign up for my Solr resources