The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and 
it has been ported to ElasticSearch.  Maybe those integrate better.

As to not doing some tokenization, I would think an extra tokenizer in you 
chain would be just the thing.

-Paul

> -----Original Message-----
> From: Trejkaz [mailto:trej...@trypticon.org]
> Sent: Tuesday, January 08, 2013 3:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is StandardAnalyzer good enough for multi languages...
> 
> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantosh...@gmail.com> wrote:
> > DoesLucene StandardAnalyzer work for all the languagues for tokenizing
> > before indexing (since we are using java, I think the content is
> > converted to UTF-8 before tokenizing/indeing)?
> 
> No. There are multiple cases where it chooses not to break something which it 
> should break. Some of
> these cases even result in undesirable behaviour for English, so I would be 
> surprised if there were even a
> single language which it handles acceptably.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to