[jira] [Updated] (LUCENE-5914) More options for stored fields compression

Adrien Grand (JIRA) Fri, 28 Nov 2014 09:55:34 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-5914:
---------------------------------
    Attachment: LUCENE-5914.patch

Here is a new patch. Quick reminder of what it brings:
 - ability to use shared dictionaries when compressing documents: this means 
that you benefit from compressing several documents in a single block while 
keeping the ability to decompress a single document at a time (and not the 
entire block). This can make the decompression of a single document 
significantly faster if your documents are small.
 - you can now trade speed for compression. In that case it will fall back to 
compressing all documents from a single block at once with deflate, which will 
make retrieval slow but the compression ratio good.
 - the stored fields reader keeps state so that iterating in order is fast and 
does not need to decompress the same thing twice (useful for merging and 
exporting the content of the index)

Compared to the previous version of the patch, we now have two standalone 
stored fields formats for both options, which makes testing easier, as well as 
a main Lucene50StoredFieldsFormat which delegates to these formats based on the 
result of Lucene50Codec.getStoredFieldsCompression.

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0
>
>         Attachments: LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the 
> same amount of users complaining that compression was too aggressive and that 
> compression was too light.
> I think it is due to the fact that we have users that are doing very 
> different things with Lucene. For example if you have a small index that fits 
> in the filesystem cache (or is close to), then you might never pay for actual 
> disk seeks and in such a case the fact that the current stored fields format 
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like 
> log analytics, and in that case you have huge amounts of data for which you 
> don't care much about stored fields performance. However it is very 
> frustrating to notice that the data that you store takes several times less 
> space when you gzip it compared to your index although Lucene claims to 
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that 
> would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5914) More options for stored fields compression

Reply via email to