Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for stored fields? Personally I don't put huge amounts of text in stored fields but these encodings/compression work extremely well on short strings like titles, etc. Removing the unicode penalty for non-latin text (i.e. cut in half) is nothing to sneeze at since with lots of docs my stored fields still become pretty huge, biggest part of the index.
I know I could use one of these schemes right now and store everything as bytes... but just thinking it might be something of more general use. The GZIP compression that is supported isn't very useful as it typically makes short snippets bigger... Performance compared to UTF-8 is here... seems like a general win to me (but maybe I am missing something) http://unicode.org/notes/tn6/#Performance -- Robert Muir rcm...@gmail.com