Re: stored fields / unicode compression

Chris Hostetter Thu, 08 Jan 2009 17:26:57 -0800

Catching up on my holiday email, I on't think there were any replies to 
this question yet.


The low level file formats used by Lucene is an area I don't have 
time/expertise to follow carefully, but if i'm remember correctly the 
concensus is/was to more more towards pure (byte[] data, int start, int 
end) based APIs for efficiency, with "String" based APIs provided as 
syntactic sugar via a facade, and deprecating the existing "internal" gzip 
compression in favor of similar "external" compression facades.  So 
something like you describe could be done as is using the byte[] 
interfaces *and* be generally useful to others.

Taking a step back to look at the broader picture, this is the kind of 
thing that in Solr could be implemented as a new FieldType

: Date: Fri, 26 Dec 2008 19:00:11 -0500
: From: Robert Muir
: Subject: stored fields / unicode compression
: 
: Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for
: stored fields?
: Personally I don't put huge amounts of text in stored fields but these
: encodings/compression work extremely well on short strings like titles, etc.
: Removing the unicode penalty for non-latin text (i.e. cut in half) is
: nothing to sneeze at since with lots of docs my stored fields still become
: pretty huge, biggest part of the index.
: 
: I know I could use one of these schemes right now and store everything as
: bytes... but just thinking it might be something of more general use. The
: GZIP compression that is supported isn't very useful as it typically makes
: short snippets bigger...
: 
: Performance compared to UTF-8 is here... seems like a general win to me (but
: maybe I am missing something)
: http://unicode.org/notes/tn6/#Performance


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: stored fields / unicode compression

Reply via email to