[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2012-10-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471473#comment-13471473
 ] 

Adrien Grand commented on LUCENE-2810:
--

[~gsingers], I think we can close this issue given that LUCENE-4226 just got 
committed. Are you OK with that?

 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2012-08-28 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443679#comment-13443679
 ] 

Adrien Grand commented on LUCENE-2810:
--

Oops, I didn't know of this issue when I opened LUCENE-4226. It tries to solve 
a very similar issue I think!

 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970868#action_12970868
 ] 

Grant Ingersoll commented on LUCENE-2810:
-

{quote}
Providing a general purpose compression with reasonable random access seems 
redundant,
modern filesystems will do this for you transparently (e.g. NTFS you just tell 
it that the .fdt should be compressed).{quote}

This may be valid for some people, but not everyone has the ability to tell 
their admin (or their downstream users) to turn on compression for a particular 
file.

 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970870#action_12970870
 ] 

Robert Muir commented on LUCENE-2810:
-

bq. Where in my email did I say that users had to use it?

We didnt force users to use the old compression either? But there are even 
emails on the userlists of someone asking 'where did compressed fields go'
and we said the reasons why, and then sure enough they reported back that it 
only made their data larger and slower.

So, I'm not sure we should add something so app-dependent to lucene's core, as 
it depends very heavily on the content you are indexing.
If people see compression in the core APIs they are going to assume that it 
works well in the general purpose case, but I'm trying to say
thats very tricky to do.

a trivial example, 
case 1: perhaps your documents have many fields all redundant with each other.
case 2: This is very different from documents that have only 1 field thats 
heavy redundant and the rest are not, e.g. nearly unique metadata.

For these two use cases you need to implement the 'compression'/layout 
completely differently or you only introduce waste, in the case of many fields 
and wrong block size you just make things bigger and it acts like Compression 
1.0 all over again.


 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970872#action_12970872
 ] 

Robert Muir commented on LUCENE-2810:
-

bq. This may be valid for some people, but not everyone has the ability to tell 
their admin (or their downstream users) to turn on compression for a particular 
file.

In this case you can do it automatically via an API as a normal user (e.g. from 
your Directory)


 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970882#action_12970882
 ] 

Robert Muir commented on LUCENE-2810:
-

bq. Again, you seem to be hung up on the word compression, so let's stop using 
it. I'm not necessarily talking about compression here, OK

To me its compression either way, you can call it data deduplication if you 
want (modern filesystems do this too!).

bq. especially since retrieving stored fields is almost always one of the 
biggest performance killers in real world applications.

I haven't had this experience, please don't try to generalize for everyone.


 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970889#action_12970889
 ] 

Robert Muir commented on LUCENE-2810:
-

bq. and the ability to reorder/change storage would be beneficial.

Right, i agree with the general ability. What I am concerned with is any 
concrete implementation, as I believe that to be very app-specific.

In other words, we should make the storage flexible in general, definitely! 
This is completely unrelated to data redundancy, its just something we should 
do so that users can more easily do what makes sense for their app.

But I'm not certain we should even provide the fundamental building blocks for 
compression/duplication. This gets complicated fast (e.g. patented algorithms 
and cryptographic hash functions), forget about some concrete implementation 
that puts these together in anything close to a general way.

Other libraries likely provide this support better than we ever could, for 
lucene i think the focus shouldn't have anything to do with data redundancy in 
particular but just making the storage API in general so that everyone's needs 
are met, not just your log file needs.


 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970892#action_12970892
 ] 

Grant Ingersoll commented on LUCENE-2810:
-

bq. Other libraries likely provide this support better than we ever could, for 
lucene i think the focus shouldn't have anything to do with data redundancy in 
particular but just making the storage API in general so that everyone's needs 
are met, not just your log file needs.

Totally agree.  I haven't talked too much on implementation yet.  Perhaps this 
alternate implementation is merely to bake in, via a contrib/module, one of 
these libraries underneath the hood.  As with all of this, documentation is the 
key.

This isn't just limited to log file needs, though, although that is probably 
one of the most extreme use cases where an alternate storage mechanism would be 
beneficial.

 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org