[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471473#comment-13471473 ] Adrien Grand commented on LUCENE-2810: -- [~gsingers], I think we can close this issue given that LUCENE-4226 just got committed. Are you OK with that? Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Core Issue Type: Improvement Components: core/store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443679#comment-13443679 ] Adrien Grand commented on LUCENE-2810: -- Oops, I didn't know of this issue when I opened LUCENE-4226. It tries to solve a very similar issue I think! Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Core Issue Type: Improvement Components: core/store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970868#action_12970868 ] Grant Ingersoll commented on LUCENE-2810: - {quote} Providing a general purpose compression with reasonable random access seems redundant, modern filesystems will do this for you transparently (e.g. NTFS you just tell it that the .fdt should be compressed).{quote} This may be valid for some people, but not everyone has the ability to tell their admin (or their downstream users) to turn on compression for a particular file. Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970870#action_12970870 ] Robert Muir commented on LUCENE-2810: - bq. Where in my email did I say that users had to use it? We didnt force users to use the old compression either? But there are even emails on the userlists of someone asking 'where did compressed fields go' and we said the reasons why, and then sure enough they reported back that it only made their data larger and slower. So, I'm not sure we should add something so app-dependent to lucene's core, as it depends very heavily on the content you are indexing. If people see compression in the core APIs they are going to assume that it works well in the general purpose case, but I'm trying to say thats very tricky to do. a trivial example, case 1: perhaps your documents have many fields all redundant with each other. case 2: This is very different from documents that have only 1 field thats heavy redundant and the rest are not, e.g. nearly unique metadata. For these two use cases you need to implement the 'compression'/layout completely differently or you only introduce waste, in the case of many fields and wrong block size you just make things bigger and it acts like Compression 1.0 all over again. Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970872#action_12970872 ] Robert Muir commented on LUCENE-2810: - bq. This may be valid for some people, but not everyone has the ability to tell their admin (or their downstream users) to turn on compression for a particular file. In this case you can do it automatically via an API as a normal user (e.g. from your Directory) Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970882#action_12970882 ] Robert Muir commented on LUCENE-2810: - bq. Again, you seem to be hung up on the word compression, so let's stop using it. I'm not necessarily talking about compression here, OK To me its compression either way, you can call it data deduplication if you want (modern filesystems do this too!). bq. especially since retrieving stored fields is almost always one of the biggest performance killers in real world applications. I haven't had this experience, please don't try to generalize for everyone. Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970889#action_12970889 ] Robert Muir commented on LUCENE-2810: - bq. and the ability to reorder/change storage would be beneficial. Right, i agree with the general ability. What I am concerned with is any concrete implementation, as I believe that to be very app-specific. In other words, we should make the storage flexible in general, definitely! This is completely unrelated to data redundancy, its just something we should do so that users can more easily do what makes sense for their app. But I'm not certain we should even provide the fundamental building blocks for compression/duplication. This gets complicated fast (e.g. patented algorithms and cryptographic hash functions), forget about some concrete implementation that puts these together in anything close to a general way. Other libraries likely provide this support better than we ever could, for lucene i think the focus shouldn't have anything to do with data redundancy in particular but just making the storage API in general so that everyone's needs are met, not just your log file needs. Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970892#action_12970892 ] Grant Ingersoll commented on LUCENE-2810: - bq. Other libraries likely provide this support better than we ever could, for lucene i think the focus shouldn't have anything to do with data redundancy in particular but just making the storage API in general so that everyone's needs are met, not just your log file needs. Totally agree. I haven't talked too much on implementation yet. Perhaps this alternate implementation is merely to bake in, via a contrib/module, one of these libraries underneath the hood. As with all of this, documentation is the key. This isn't just limited to log file needs, though, although that is probably one of the most extreme use cases where an alternate storage mechanism would be beneficial. Explore Alternate Stored Field approaches for highly redundant data --- Key: LUCENE-2810 URL: https://issues.apache.org/jira/browse/LUCENE-2810 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Grant Ingersoll Assignee: Grant Ingersoll In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well. It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org