[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970860#action_12970860 ]
Shai Erera commented on LUCENE-2810: ------------------------------------ bq. in any event, its useless to add any compression that doesn't beat what filesystems can already do on average. I'm not sure it's *useless* ... consider an application like Google Desktop Search developed on top of Lucene. You cannot force users to compress the installation folder, yet it'll still be valuable to have Lucene compress stuff on its own .. especially stuff that it chooses to store. Such applications are special in that they offer a service to the user, that's installed on his/her machine, and w/o control of the one that actually developed it. Therefore I find myself tuning my Lucene-based app as much as I can, and often don't rely on users enabling certain OS features (and who knows if one day those features will be gone?). Today I handle compressed fields by using Lucene's CompressionTools, and I'm generally happy with it. If however there will be a compressed-store that will improve the performance of my application by compressing the stored fields otherwise, achieving better compression ratio etc., it might be useful. Especially if its integration will be a no brainer. I think though we'd want to differentiate fields - not all of them should be compressed, because it means they'll need to be de-compressed, which might be expensive for some apps. > Stored Fields Compression > ------------------------- > > Key: LUCENE-2810 > URL: https://issues.apache.org/jira/browse/LUCENE-2810 > Project: Lucene - Java > Issue Type: Improvement > Components: Store > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > > In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for > documents contain a lot of redundant information and end up wasting a lot of > space across a large collection of documents. For instance, simply > compressing a typical log file often results in > 75% compression rates. We > should explore mechanisms for applying compression across all the documents > for a field (or fields) while still maintaining relatively fast lookup (that > being said, in most logging applications, fast retrieval of a given event is > not always critical.) For instance, perhaps it is possible to have a part of > storage that contains the set of unique values for all the fields and the > document field value simply contains a reference (could be as small as a few > bits depending on the number of uniq. items) to that value instead of having > a full copy. Extending this, perhaps we can leverage some existing > compression capabilities in Java to provide this as well. > It may make sense to implement this as a Directory, but it might also make > sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org