[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

Robert Muir (JIRA) Mon, 13 Dec 2010 08:48:28 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970889#action_12970889
 ]


Robert Muir commented on LUCENE-2810:
-------------------------------------

bq. and the ability to reorder/change storage would be beneficial.

Right, i agree with the "general ability". What I am concerned with is any 
concrete implementation, as I believe that to be very app-specific.

In other words, we should make the storage flexible in general, definitely! 
This is completely unrelated to data redundancy, its just something we should 
do so that users can more easily do what makes sense for their app.

But I'm not certain we should even provide the fundamental building blocks for 
"compression/duplication". This gets complicated fast (e.g. patented algorithms 
and cryptographic hash functions), forget about some concrete implementation 
that puts these together in anything close to a general way.

Other libraries likely provide this support better than we ever could, for 
lucene i think the focus shouldn't have anything to do with data redundancy in 
particular but just making the storage API in general so that everyone's needs 
are met, not just your log file needs.


> Explore Alternate Stored Field approaches for highly redundant data
> -------------------------------------------------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
> documents contain a lot of redundant information and end up wasting a lot of 
> space across a large collection of documents.  For instance, simply 
> compressing a typical log file often results in > 75% compression rates.  We 
> should explore mechanisms for applying compression across all the documents 
> for a field (or fields) while still maintaining relatively fast lookup (that 
> being said, in most logging applications, fast retrieval of a given event is 
> not always critical.)  For instance, perhaps it is possible to have a part of 
> storage that contains the set of unique values for all the fields and the 
> document field value simply contains a reference (could be as small as a few 
> bits depending on the number of uniq. items) to that value instead of having 
> a full copy.  Extending this, perhaps we can leverage some existing 
> compression capabilities in Java to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make 
> sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

Reply via email to