Re: Problems indexing Japanese with CJKAnalyzer

Steven Rowe Tue, 06 Jul 2004 10:23:32 -0700

Hi Jon,

It sounds to me like you have a character encoding problem. The native2ascii tool is designed to produce input for the Java compiler; the "\u7aef" notation you're seeing is understood by Java string interpreters to mean the corresponding hexadecimal Unicode code point. Other Java programs, however, depending on their implementation, may not understand this notation. Alternatively, maybe the notation is understood, but the conversion from Shift-JIS to Java Unicode format is not being performed properly; if you don't tell native2ascii the source encoding, it will assume the "native" encoding for the platform--on Windows, depending on which localized version you've got, this is likely to be the so-called code page 1252 (ISO-8859-1 with a few modifications). Converting from one character encoding to another with incorrect assumptions about the source encoding can only lead to sorrow and confusion.

I think you can use the native2ascii tool to do what you want (untested), but it will take two passes:

1. Use native2ascii to convert your file(s) to Java Unicode format, but tell it the source encoding:

   native2ascii -encoding SJIS inputfile outputfile1

2. Tell it to convert from Java Unicode format to UTF-8:

   native2ascii -reverse -encoding UTF8 outputfile1 finaloutput

Here's a web page with more information on native2ascii:

<URL:http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html>

Hope it helps,
Steve Rowe

Jon Schuster wrote:

I've gone through all of the past messages regarding the CJKAnalyzer but I
still must be doing something wrong because my searches don't work.

I'm using the IndexHTML application from the org.apache.lucene.demo package
to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
I've also tried with and without setting the file.encoding to Shift-JIS.
I've tried indexing the HTML files, which contain Shift-JIS, without
conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
messages. I've also tried converting the Shift-JIS HTML files to Unicode by
first running them through the native2ascii tool.

When the files are converted via native2ascii, they index without errors,
but the index appears to contain the Unicode characters as literal strings
such as "u7aef", "u7af6", etc. Searching for an English word produces
results that have text like "code \u5c5e\u6027".

Since others have gotten Japanese indexing to work, what's the secret I'm
missing?

Thanks,
Jon

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problems indexing Japanese with CJKAnalyzer

Reply via email to