The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and it has been ported to ElasticSearch. Maybe those integrate better.
As to not doing some tokenization, I would think an extra tokenizer in you chain would be just the thing. -Paul > -----Original Message----- > From: Trejkaz [mailto:trej...@trypticon.org] > Sent: Tuesday, January 08, 2013 3:44 PM > To: java-user@lucene.apache.org > Subject: Re: Is StandardAnalyzer good enough for multi languages... > > On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantosh...@gmail.com> wrote: > > DoesLucene StandardAnalyzer work for all the languagues for tokenizing > > before indexing (since we are using java, I think the content is > > converted to UTF-8 before tokenizing/indeing)? > > No. There are multiple cases where it chooses not to break something which it > should break. Some of > these cases even result in undesirable behaviour for English, so I would be > surprised if there were even a > single language which it handles acceptably. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org