Re: stored fields / unicode compression

Robert Muir Thu, 08 Jan 2009 18:45:50 -0800

thanks for the response, this sounds great. some way to plug in arbitrary
schemes would be helpful.


I've experimented with a few for my case and unicode compression gave the
best bang for the buck, but i remember some of the other schemes such as
arithmetic coding seemed to provide wins for reasonably short fields where
gzip was still making them bigger...

On Thu, Jan 8, 2009 at 8:26 PM, Chris Hostetter <[email protected]>wrote:

>
> Catching up on my holiday email, I on't think there were any replies to
> this question yet.
>
> The low level file formats used by Lucene is an area I don't have
> time/expertise to follow carefully, but if i'm remember correctly the
> concensus is/was to more more towards pure (byte[] data, int start, int
> end) based APIs for efficiency, with "String" based APIs provided as
> syntactic sugar via a facade, and deprecating the existing "internal" gzip
> compression in favor of similar "external" compression facades.  So
> something like you describe could be done as is using the byte[]
> interfaces *and* be generally useful to others.
>
> Taking a step back to look at the broader picture, this is the kind of
> thing that in Solr could be implemented as a new FieldType
>
> : Date: Fri, 26 Dec 2008 19:00:11 -0500
> : From: Robert Muir
> : Subject: stored fields / unicode compression
> :
> : Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for
> : stored fields?
> : Personally I don't put huge amounts of text in stored fields but these
> : encodings/compression work extremely well on short strings like titles,
> etc.
> : Removing the unicode penalty for non-latin text (i.e. cut in half) is
> : nothing to sneeze at since with lots of docs my stored fields still
> become
> : pretty huge, biggest part of the index.
> :
> : I know I could use one of these schemes right now and store everything as
> : bytes... but just thinking it might be something of more general use. The
> : GZIP compression that is supported isn't very useful as it typically
> makes
> : short snippets bigger...
> :
> : Performance compared to UTF-8 is here... seems like a general win to me
> (but
> : maybe I am missing something)
> : http://unicode.org/notes/tn6/#Performance
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Re: stored fields / unicode compression

Reply via email to