thanks for the response, this sounds great. some way to plug in arbitrary schemes would be helpful.
I've experimented with a few for my case and unicode compression gave the best bang for the buck, but i remember some of the other schemes such as arithmetic coding seemed to provide wins for reasonably short fields where gzip was still making them bigger... On Thu, Jan 8, 2009 at 8:26 PM, Chris Hostetter <hossman_luc...@fucit.org>wrote: > > Catching up on my holiday email, I on't think there were any replies to > this question yet. > > The low level file formats used by Lucene is an area I don't have > time/expertise to follow carefully, but if i'm remember correctly the > concensus is/was to more more towards pure (byte[] data, int start, int > end) based APIs for efficiency, with "String" based APIs provided as > syntactic sugar via a facade, and deprecating the existing "internal" gzip > compression in favor of similar "external" compression facades. So > something like you describe could be done as is using the byte[] > interfaces *and* be generally useful to others. > > Taking a step back to look at the broader picture, this is the kind of > thing that in Solr could be implemented as a new FieldType > > : Date: Fri, 26 Dec 2008 19:00:11 -0500 > : From: Robert Muir > : Subject: stored fields / unicode compression > : > : Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for > : stored fields? > : Personally I don't put huge amounts of text in stored fields but these > : encodings/compression work extremely well on short strings like titles, > etc. > : Removing the unicode penalty for non-latin text (i.e. cut in half) is > : nothing to sneeze at since with lots of docs my stored fields still > become > : pretty huge, biggest part of the index. > : > : I know I could use one of these schemes right now and store everything as > : bytes... but just thinking it might be something of more general use. The > : GZIP compression that is supported isn't very useful as it typically > makes > : short snippets bigger... > : > : Performance compared to UTF-8 is here... seems like a general win to me > (but > : maybe I am missing something) > : http://unicode.org/notes/tn6/#Performance > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com