RE: Lucene standard analyzer internationalization

Steven A Rowe Tue, 22 Apr 2008 12:03:53 -0700

Hi Prashant,

On 04/22/2008 at 2:23 PM, Prashant Malik wrote:
>     We have been observing the following problem while
> tokenizing using lucene's StandardAnalyzer. Tokens that we get is
> different on different machines. I am suspecting it has something to do
> with the Locale settings on individual machines?
> 
> For example
> the word 'CÃ(c)sar'   is split as  'CÃ(c)sar'   on machine 1
> 
> while it is split into [cã, sar]  on machine 2 .
> 
> Could someone please tell me what might be going on?


Which version of Lucene are you using?  Is it the same on both machines?

I ask because Lucene recently switched StandardTokenizer lexer generation from 
JavaCC to JFlex, for performance reasons (increased throughput).

Also, my email viewer displays the word in question as the following sequence 
of characters:

 1. Capital "C"
 2. Capital "A" with a tilda ("~") above it
 3. Left parenthesis
 4. Lowercase "c"
 5. Right parenthesis
 6. Lowercase "s"
 7. Lowercase "a"
 8. Lowercase "r"

Is this the correct character sequence? (Sometimes UTF-8 can look similar to 
this when it's interpreted as Latin-1.)

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene standard analyzer internationalization

Reply via email to