12 dec 2005 kl. 16.40 skrev karl wettin:

Hello list,

I'm looking for a way to change character encoding per index. It feels silly to store chinese characters in 3 bytes using UTF-8 when it is possible to do it with 2 bytes using UTF-16. By just hacking the IndexInput and IndexOutput I quick and dirty got it all running in UTF-16, but this is not good enough since I have other indexes that is more optimized when encoded in UTF-8.

The character encoding of Lucene today is quite static. In order to select encoding it seems to me I have to do some major refactoring to the project, passing a character codec from my analyzer (or perhaps IndexWriter/Reader) all the way down to the IndexInput/ Output via TermVector/Info, et.c.

Can someone think of a better way to set character encoding per index? Or perhaps some other thought?

My current thought is to extend Directory (CharacterEncodingAwareDirectory or so) and all implementations of it to intercept the create/openFile methods and add a character encoding strategy to the IndexInput/Output.

Is there a reason for the write/readCharacters in IndexInput/Output to be final?

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to