subject:"\[jira\] Commented\: \(LUCENE\-2810\) Explore Alternate Stored Field approaches for highly redundant data"

[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2012-10-08 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471473#comment-13471473
 ] 

Adrien Grand commented on LUCENE-2810:
--

[~gsingers], I think we can close this issue given that LUCENE-4226 just got 
committed. Are you OK with that?

 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2012-08-28 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443679#comment-13443679
 ] 

Adrien Grand commented on LUCENE-2810:
--

Oops, I didn't know of this issue when I opened LUCENE-4226. It tries to solve 
a very similar issue I think!

 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970868#action_12970868
 ] 

Grant Ingersoll commented on LUCENE-2810:
-

{quote}
Providing a general purpose compression with reasonable random access seems 
redundant,
modern filesystems will do this for you transparently (e.g. NTFS you just tell 
it that the .fdt should be compressed).{quote}

This may be valid for some people, but not everyone has the ability to tell 
their admin (or their downstream users) to turn on compression for a particular 
file.

 Explore Alternate Stored Field approaches for highly redundant data
 ---

 Key: LUCENE-2810
 URL: https://issues.apache.org/jira/browse/LUCENE-2810
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
 documents contain a lot of redundant information and end up wasting a lot of 
 space across a large collection of documents.  For instance, simply 
 compressing a typical log file often results in  75% compression rates.  We 
 should explore mechanisms for applying compression across all the documents 
 for a field (or fields) while still maintaining relatively fast lookup (that 
 being said, in most logging applications, fast retrieval of a given event is 
 not always critical.)  For instance, perhaps it is possible to have a part of 
 storage that contains the set of unique values for all the fields and the 
 document field value simply contains a reference (could be as small as a few 
 bits depending on the number of uniq. items) to that value instead of having 
 a full copy.  Extending this, perhaps we can leverage some existing 
 compression capabilities in Java to provide this as well.  
 It may make sense to implement this as a Directory, but it might also make 
 sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970870#action_12970870
]

Robert Muir commented on LUCENE-2810:
-

bq. Where in my email did I say that users had to use it?

We didnt force users to use the old compression either? But there are even
emails on the userlists of someone asking 'where did compressed fields go'
and we said the reasons why, and then sure enough they reported back that it
only made their data larger and slower.

So, I'm not sure we should add something so app-dependent to lucene's core, as
it depends very heavily on the content you are indexing.
If people see compression in the core APIs they are going to assume that it
works well in the general purpose case, but I'm trying to say
thats very tricky to do.

a trivial example,
case 1: perhaps your documents have many fields all redundant with each other.
case 2: This is very different from documents that have only 1 field thats
heavy redundant and the rest are not, e.g. nearly unique metadata.

For these two use cases you need to implement the 'compression'/layout
completely differently or you only introduce waste, in the case of many fields
and wrong block size you just make things bigger and it acts like Compression
1.0 all over again.

Explore Alternate Stored Field approaches for highly redundant data
---

Key: LUCENE-2810
URL: https://issues.apache.org/jira/browse/LUCENE-2810
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for
documents contain a lot of redundant information and end up wasting a lot of
space across a large collection of documents. For instance, simply
compressing a typical log file often results in 75% compression rates. We
should explore mechanisms for applying compression across all the documents
for a field (or fields) while still maintaining relatively fast lookup (that
being said, in most logging applications, fast retrieval of a given event is
not always critical.) For instance, perhaps it is possible to have a part of
storage that contains the set of unique values for all the fields and the
document field value simply contains a reference (could be as small as a few
bits depending on the number of uniq. items) to that value instead of having
a full copy. Extending this, perhaps we can leverage some existing
compression capabilities in Java to provide this as well.
It may make sense to implement this as a Directory, but it might also make
sense as a Codec, if and when we have support for changing storage Codecs.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970872#action_12970872
]

Robert Muir commented on LUCENE-2810:
-

bq. This may be valid for some people, but not everyone has the ability to tell
their admin (or their downstream users) to turn on compression for a particular
file.

In this case you can do it automatically via an API as a normal user (e.g. from
your Directory)

Explore Alternate Stored Field approaches for highly redundant data
---

Key: LUCENE-2810
URL: https://issues.apache.org/jira/browse/LUCENE-2810
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970882#action_12970882
]

Robert Muir commented on LUCENE-2810:
-

bq. Again, you seem to be hung up on the word compression, so let's stop using
it. I'm not necessarily talking about compression here, OK

To me its compression either way, you can call it data deduplication if you
want (modern filesystems do this too!).

bq. especially since retrieving stored fields is almost always one of the
biggest performance killers in real world applications.

I haven't had this experience, please don't try to generalize for everyone.

Explore Alternate Stored Field approaches for highly redundant data
---

Key: LUCENE-2810
URL: https://issues.apache.org/jira/browse/LUCENE-2810
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970889#action_12970889
]

Robert Muir commented on LUCENE-2810:
-

bq. and the ability to reorder/change storage would be beneficial.

Right, i agree with the general ability. What I am concerned with is any
concrete implementation, as I believe that to be very app-specific.

In other words, we should make the storage flexible in general, definitely!
This is completely unrelated to data redundancy, its just something we should
do so that users can more easily do what makes sense for their app.

But I'm not certain we should even provide the fundamental building blocks for
compression/duplication. This gets complicated fast (e.g. patented algorithms
and cryptographic hash functions), forget about some concrete implementation
that puts these together in anything close to a general way.

Other libraries likely provide this support better than we ever could, for
lucene i think the focus shouldn't have anything to do with data redundancy in
particular but just making the storage API in general so that everyone's needs
are met, not just your log file needs.

Explore Alternate Stored Field approaches for highly redundant data
---

Key: LUCENE-2810
URL: https://issues.apache.org/jira/browse/LUCENE-2810
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

2010-12-13 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970892#action_12970892
]

Grant Ingersoll commented on LUCENE-2810:
-

bq. Other libraries likely provide this support better than we ever could, for
lucene i think the focus shouldn't have anything to do with data redundancy in
particular but just making the storage API in general so that everyone's needs
are met, not just your log file needs.

Totally agree. I haven't talked too much on implementation yet. Perhaps this
alternate implementation is merely to bake in, via a contrib/module, one of
these libraries underneath the hood. As with all of this, documentation is the
key.

This isn't just limited to log file needs, though, although that is probably
one of the most extreme use cases where an alternate storage mechanism would be
beneficial.

Explore Alternate Stored Field approaches for highly redundant data
---

Key: LUCENE-2810
URL: https://issues.apache.org/jira/browse/LUCENE-2810
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

[jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

8 matches

Site Navigation

Mail list logo

Footer information