Re: Character encoding per index.

Marvin Humphrey Mon, 12 Dec 2005 10:14:43 -0800

On Dec 12, 2005, at 10:04 AM, karl wettin wrote:

12 dec 2005 kl. 16.40 skrev karl wettin:
Hello list,
I'm looking for a way to change character encoding per index. Itfeels silly to store chinese characters in 3 bytes using UTF-8when it is possible to do it with 2 bytes using UTF-16. By justhacking the IndexInput and IndexOutput I quick and dirty got itall running in UTF-16, but this is not good enough since I haveother indexes that is more optimized when encoded in UTF-8.
The character encoding of Lucene today is quite static. In orderto select encoding it seems to me I have to do some majorrefactoring to the project, passing a character codec from myanalyzer (or perhaps IndexWriter/Reader) all the way down to theIndexInput/Output via TermVector/Info, et.c.

On a side note, this is another issue that I believe can be addressedby using a bytecount instead of a charcount at the head of Lucene'sStrings.


A byte-based TermBuffer needn't care what encoding the Strings are in.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Character encoding per index.

Reply via email to