Re: Lucene standard analyzer internationalization

Prashant Malik Tue, 22 Apr 2008 13:45:35 -0700

Yes the version of lucene and java are exactly the same on the different
machines.
Infact we unjared lucene and jared it with our jar and are running from the
same nfs mounts on both the machines


Also we have tried with lucene2.2.0 and 2.3.1. with the same result .

also about the actual string u have it right till 2 .

        3,4,5 are a single character

Thx
PM

On Tue, Apr 22, 2008 at 12:01 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote:

> Hi Prashant,
>
> On 04/22/2008 at 2:23 PM, Prashant Malik wrote:
> >     We have been observing the following problem while
> > tokenizing using lucene's StandardAnalyzer. Tokens that we get is
> > different on different machines. I am suspecting it has something to do
> > with the Locale settings on individual machines?
> >
> > For example
> > the word 'CÃ(c)sar'   is split as  'CÃ(c)sar'   on machine 1
> >
> > while it is split into [cã, sar]  on machine 2 .
> >
> > Could someone please tell me what might be going on?
>
> Which version of Lucene are you using?  Is it the same on both machines?
>
> I ask because Lucene recently switched StandardTokenizer lexer generation
> from JavaCC to JFlex, for performance reasons (increased throughput).
>
> Also, my email viewer displays the word in question as the following
> sequence of characters:
>
>  1. Capital "C"
>  2. Capital "A" with a tilda ("~") above it
>  3. Left parenthesis
>  4. Lowercase "c"
>  5. Right parenthesis
>  6. Lowercase "s"
>  7. Lowercase "a"
>  8. Lowercase "r"
>
> Is this the correct character sequence? (Sometimes UTF-8 can look similar
> to this when it's interpreted as Latin-1.)
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Lucene standard analyzer internationalization

Reply via email to