[jira] [Updated] (LUCENE-5914) More options for stored fields compression

Adrien Grand (JIRA) Mon, 01 Dec 2014 16:04:30 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-5914:
---------------------------------
    Attachment: LUCENE-5914.patch

Here is a new patch that iterates on Robert's:
 - improved compression for numerics:
 - floats and doubles representing small integers take 1 byte
 - other positive floats and doubles take 4 / 8 bytes
 - other floats and doubles (negative) take 5 / 9 bytes
 - doubles that are actually casted floats take 5 bytes
 - longs are compressed if they represent a timestamp (2 bits are used to 
encode for the fact that the number is a multiple of a second, hour, day, or is 
uncompressed)
 - clean up of the checkFooter calls in the reader
 - slightly better encoding of the offsets with the BEST_SPEED option by using 
monotonic encoding: this allows to just slurp a sequence of bytes and then 
decode a single value instead of having to decode lengths and sum them up in 
order to have offsets (the BEST_COMPRESSION option still does this however)
 - fixed some javadocs errors

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0
>
>         Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
> LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the 
> same amount of users complaining that compression was too aggressive and that 
> compression was too light.
> I think it is due to the fact that we have users that are doing very 
> different things with Lucene. For example if you have a small index that fits 
> in the filesystem cache (or is close to), then you might never pay for actual 
> disk seeks and in such a case the fact that the current stored fields format 
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like 
> log analytics, and in that case you have huge amounts of data for which you 
> don't care much about stored fields performance. However it is very 
> frustrating to notice that the data that you store takes several times less 
> space when you gzip it compared to your index although Lucene claims to 
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that 
> would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5914) More options for stored fields compression

Reply via email to