Re: Character encoding per index.

karl wettin Mon, 12 Dec 2005 10:04:04 -0800


12 dec 2005 kl. 16.40 skrev karl wettin:

Hello list,
I'm looking for a way to change character encoding per index. Itfeels silly to store chinese characters in 3 bytes using UTF-8 whenit is possible to do it with 2 bytes using UTF-16. By just hackingthe IndexInput and IndexOutput I quick and dirty got it all runningin UTF-16, but this is not good enough since I have other indexesthat is more optimized when encoded in UTF-8.
The character encoding of Lucene today is quite static. In order toselect encoding it seems to me I have to do some major refactoringto the project, passing a character codec from my analyzer (orperhaps IndexWriter/Reader) all the way down to the IndexInput/Output via TermVector/Info, et.c.
Can someone think of a better way to set character encoding perindex? Or perhaps some other thought?

My current thought is to extend Directory(CharacterEncodingAwareDirectory or so) and all implementations of itto intercept the create/openFile methods and add a character encodingstrategy to the IndexInput/Output.

Is there a reason for the write/readCharacters in IndexInput/Outputto be final?


--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Character encoding per index.

Reply via email to