[jira] Commented: (LUCENE-2810) Stored Fields Compression

Shai Erera (JIRA) Mon, 13 Dec 2010 07:57:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970860#action_12970860
 ]


Shai Erera commented on LUCENE-2810:
------------------------------------

bq. in any event, its useless to add any compression that doesn't beat what 
filesystems can already do on average.

I'm not sure it's *useless* ... consider an application like Google Desktop 
Search developed on top of Lucene. You cannot force users to compress the 
installation folder, yet it'll still be valuable to have Lucene compress stuff 
on its own .. especially stuff that it chooses to store. Such applications are 
special in that they offer a service to the user, that's installed on his/her 
machine, and w/o control of the one that actually developed it. Therefore I 
find myself tuning my Lucene-based app as much as I can, and often don't rely 
on users enabling certain OS features (and who knows if one day those features 
will be gone?).

Today I handle compressed fields by using Lucene's CompressionTools, and I'm 
generally happy with it. If however there will be a compressed-store that will 
improve the performance of my application by compressing the stored fields 
otherwise, achieving better compression ratio etc., it might be useful. 
Especially if its integration will be a no brainer.

I think though we'd want to differentiate fields - not all of them should be 
compressed, because it means they'll need to be de-compressed, which might be 
expensive for some apps.

> Stored Fields Compression
> -------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
> documents contain a lot of redundant information and end up wasting a lot of 
> space across a large collection of documents.  For instance, simply 
> compressing a typical log file often results in > 75% compression rates.  We 
> should explore mechanisms for applying compression across all the documents 
> for a field (or fields) while still maintaining relatively fast lookup (that 
> being said, in most logging applications, fast retrieval of a given event is 
> not always critical.)  For instance, perhaps it is possible to have a part of 
> storage that contains the set of unique values for all the fields and the 
> document field value simply contains a reference (could be as small as a few 
> bits depending on the number of uniq. items) to that value instead of having 
> a full copy.  Extending this, perhaps we can leverage some existing 
> compression capabilities in Java to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make 
> sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Stored Fields Compression

Reply via email to