Catching up on my holiday email, I on't think there were any replies to this question yet.
The low level file formats used by Lucene is an area I don't have time/expertise to follow carefully, but if i'm remember correctly the concensus is/was to more more towards pure (byte[] data, int start, int end) based APIs for efficiency, with "String" based APIs provided as syntactic sugar via a facade, and deprecating the existing "internal" gzip compression in favor of similar "external" compression facades. So something like you describe could be done as is using the byte[] interfaces *and* be generally useful to others. Taking a step back to look at the broader picture, this is the kind of thing that in Solr could be implemented as a new FieldType : Date: Fri, 26 Dec 2008 19:00:11 -0500 : From: Robert Muir : Subject: stored fields / unicode compression : : Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for : stored fields? : Personally I don't put huge amounts of text in stored fields but these : encodings/compression work extremely well on short strings like titles, etc. : Removing the unicode penalty for non-latin text (i.e. cut in half) is : nothing to sneeze at since with lots of docs my stored fields still become : pretty huge, biggest part of the index. : : I know I could use one of these schemes right now and store everything as : bytes... but just thinking it might be something of more general use. The : GZIP compression that is supported isn't very useful as it typically makes : short snippets bigger... : : Performance compared to UTF-8 is here... seems like a general win to me (but : maybe I am missing something) : http://unicode.org/notes/tn6/#Performance -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org