Problems indexing Japanese with CJKAnalyzer

Jon Schuster Fri, 02 Jul 2004 13:49:51 -0700

Hi,

I've gone through all of the past messages regarding the CJKAnalyzer but I
still must be doing something wrong because my searches don't work.


I'm using the IndexHTML application from the org.apache.lucene.demo package
to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
I've also tried with and without setting the file.encoding to Shift-JIS.
I've tried indexing the HTML files, which contain Shift-JIS, without
conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
messages. I've also tried converting the Shift-JIS HTML files to Unicode by
first running them through the native2ascii tool.

When the files are converted via native2ascii, they index without errors,
but the index appears to contain the Unicode characters as literal strings
such as "u7aef", "u7af6", etc. Searching for an English word produces
results that have text like "code \u5c5e\u6027".

Since others have gotten Japanese indexing to work, what's the secret I'm
missing?

Thanks,
Jon


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Problems indexing Japanese with CJKAnalyzer

Reply via email to