[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468921#comment-13468921
 ] 

Robert Muir commented on LUCENE-4226:
-

im on the phone but i have some questions. give me a few :)

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-03 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468945#comment-13468945
 ] 

Adrien Grand commented on LUCENE-4226:
--

Oh, I didn't mean THAT shortly :-)

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468949#comment-13468949
 ] 

Robert Muir commented on LUCENE-4226:
-

Shouldnt ByteArrayDataInput override skip to just bump its 'pos'?

Can we plugin various schemes into MockRandomCodec?

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468953#comment-13468953
 ] 

Robert Muir commented on LUCENE-4226:
-

OK MockRandom is just postings now... I think we should have a MockRandomCodec 
too!

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469008#comment-13469008
 ] 

Robert Muir commented on LUCENE-4226:
-

Yeah: i opened another issue to try to straighten this out. We can just bring 
these frankenstein codecs upto speed there.

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469079#comment-13469079
 ] 

Robert Muir commented on LUCENE-4226:
-

I'm not a fan of the skipBytes on DataInput. Its not actually necessary or used 
for this patch?

And today DataInput is always forward-only, i dont like the "may or may not be 
bidirectional depending if the impl throws UOE".

I removed it locally and just left it on IndexInput, i think this is cleaner.

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-04 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469216#comment-13469216
 ] 

Adrien Grand commented on LUCENE-4226:
--

bq. I'm not a fan of the skipBytes on DataInput. Its not actually necessary or 
used for this patch?

You're right! I needed it in the first versions of the patch when I reused 
Lucene40StoredFieldsFormat, but it looks like it's not needed anymore. Let's 
get rid of it!

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-05 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470392#comment-13470392
 ] 

Adrien Grand commented on LUCENE-4226:
--

I just committed to trunk. I'll wait for a couple of days to make sure Jenkins 
builds pass before backporting to 4.x. By the way, would it be possible to have 
one of the Jenkins servers to run lucene-core tests with 
-Dtests.codec=Compressing for some time?

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-05 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470509#comment-13470509
 ] 

Simon Willnauer commented on LUCENE-4226:
-

bq. By the way, would it be possible to have one of the Jenkins servers to run 
lucene-core tests with -Dtests.codec=Compressing for some time?

FYI - 
http://builds.flonkings.com/job/Lucene-trunk-Linux-Java6-64-test-only-compressed/

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-05 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470699#comment-13470699
 ] 

Adrien Grand commented on LUCENE-4226:
--

Thanks, Simon!

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-16 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477038#comment-13477038
 ] 

Radim Kolar commented on LUCENE-4226:
-

is there example config provided?

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477041#comment-13477041
 ] 

Simon Willnauer commented on LUCENE-4226:
-

@adrien I deleted the jenkins job for this.

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-10-16 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477062#comment-13477062
 ] 

Adrien Grand commented on LUCENE-4226:
--

@radim you can have a look at CompressingCodec in lucene/test-framework
@Simon ok, thanks!

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-29 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443866#comment-13443866
 ] 

Dawid Weiss commented on LUCENE-4226:
-

Very cool. I skimmed through the patch, didn't look too carefully. This caught 
my attention:
{code}
+  /**
+   * Skip over the next n bytes.
+   */
+  public void skipBytes(long n) throws IOException {
+for (long i = 0; i < n; ++i) {
+  readByte();
+}
+  }
{code}

you may want to use an array-based read here if there are a lot of skips; 
allocate a static, write-only buffer of 4 or 8kb once and just reuse it. A loop 
over readByte() is nearly always a performance killer, I've been hit by this 
too many times to count.

Also, 
lucene/core/src/java/org/apache/lucene/codecs/compressing/ByteArrayDataOutput.java
 -- there seems to be a class for this in
org.apache.lucene.store.ByteArrayDataOutput?


> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-29 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443897#comment-13443897
 ] 

Eks Dev commented on LUCENE-4226:
-

bq. but I removed the ability to select the compression algorithm on a 
per-field basis in order to make the patch simpler and to handle cross-field 
compression.

Maybe it is worth to keep it there for really short fields. Those general 
compression algorithms are great for bigger amounts of data, but for really 
short fields there is nothing like per field compression.   
Thinking about database usage, e.g. fields with low cardinality, or fields with 
restricted symbol set (only digits in long UID field for example).  Say zip 
code, product color...  is perfectly compressed using something with static 
dictionary approach (static huffman coder with escape symbol-s, at bit level, 
or plain vanilla dictionary lookup), and both of them are insanely fast and 
compress heavily. 

Even trivial utility for users is easily doable, index data without 
compression, get the frequencies from the term dictionary-> estimate e.g. 
static Huffman code table and reindex with this dictionary. 


> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-29 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443939#comment-13443939
 ] 

Adrien Grand commented on LUCENE-4226:
--

Thanks Dawid and Eks for your feedback!

bq. allocate a static, write-only buffer of 4 or 8kb once and just reuse it

Right, sounds like a better default impl!

bq. ByteArrayDataOutput.java – there seems to be a class for this in 
org.apache.lucene.store.ByteArrayDataOutput?

I wanted to reuse this class, but I needed something that would grow when 
necessary (oal.store.BADO just throws an exception when you try to write past 
the end of the buffer). I could manage growth externally based on checks on the 
buffer length and calls to ArrayUtil.grow and BADO.reset but it was just as 
simple to rewrite a ByteArrayDataOutput that would manage it internally...

bq. Maybe it is worth to keep it there for really short fields. Those general 
compression algorithms are great for bigger amounts of data, but for really 
short fields there is nothing like per field compression.  Thinking about 
database usage, e.g. fields with low cardinality, or fields with restricted 
symbol set (only digits in long UID field for example). Say zip code, product 
color... is perfectly compressed using something with static dictionary 
approach (static huffman coder with escape symbol-s, at bit level, or plain 
vanilla dictionary lookup), and both of them are insanely fast and compress 
heavily.

Right, this is exactly why I implemented per-field compression first. Both 
per-field and cross-field compression have pros and cons. Cross-field 
compression allows less fine-grained tuning but on the other hand it would 
probably be a better default since the compression ratio would be better out of 
the box. Maybe we should implement both?

I was also thinking that some codecs such as this kind of per-field 
compression, but maybe even the bloom, memory, direct and pulsing postings 
formats might deserve a separate "codecs" module where we could put these 
non-default "expert" codecs.

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/

[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443962#comment-13443962
 ] 

Robert Muir commented on LUCENE-4226:
-

{quote}
I was also thinking that some codecs such as this kind of per-field 
compression, but maybe even the bloom, memory, direct and pulsing postings 
formats might deserve a separate "codecs" module where we could put these 
non-default "expert" codecs.
{quote}

We have to do something about this soon!

Do you want to open a separate issue for that (it need not block this issue)?

I think we would try to get everything concrete we can out of core immediately
(maybe saving only the default codec for that release), but use the other
ones for testing. Still we should think about it.


> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-29 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443996#comment-13443996
 ] 

Adrien Grand commented on LUCENE-4226:
--

bq. Do you want to open a separate issue for that (it need not block this 
issue)?

I created LUCENE-4340.

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-30 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445022#comment-13445022
 ] 

David Smiley commented on LUCENE-4226:
--

I just have a word of encouragement -- this is awesome!  Keep up the good work 
Adrien.

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-09-10 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451867#comment-13451867
 ] 

Adrien Grand commented on LUCENE-4226:
--

Otis shared a link to this issue on Twitter 
https://twitter.com/otisg/status/244996292743405571 and some people seem to 
wonder how it compares to ElasticSearch's block compression.

ElasticSearch's block compression uses a similar idea: data is compressed into 
blocks (with fixed sizes that are independent from document sizes). It is based 
on a CompressedIndexInput/CompressedIndexOutput: Upon closing, 
CompressedIndexOutput writes a metadata table at the end of the wrapped output 
that contains the start offset of every compressed block. Upon creation, a 
CompressedIndexInput first loads this metadata table into memory and can then 
use it whenever it needs to seek. This is probably the best way to compress 
small docs with Lucene 3.x.

With this patch, the size of blocks is not completely independent from document 
sizes: I make sure that documents don't spread across compressed blocks so that 
reading a document never requires more than one block to be uncompressed. 
Moreover, the LZ4 uncompressor (used by FAST and FAST_UNCOMPRESSION) can stop 
uncompressing whenever it has uncompressed enough data. So unless you need the 
last document of a compressed block, it is very likely that the uncompressor 
won't uncompress the whole block before returning.

Therefore I expect this StoredFieldsFormat to have a similar compression ratio 
to ElasticSearch's block compression (provided that similar compression 
algorithms are used) but to be a little faster at loading documents from disk.

> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operation

[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2013-03-22 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610701#comment-13610701
 ] 

Commit Tag Bot commented on LUCENE-4226:


[branch_4x commit] Adrien Grand
http://svn.apache.org/viewvc?view=revision&revision=1395491

LUCENE-4226: Efficient stored fields compression (merged from r1394578).



> Efficient compression of small to medium stored fields
> --
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format   Compression ratio IndexReader.document time
> 
> uncompressed 100%  100%
> doc/deflate 1 59%  616%
> doc/deflate 9 58%  595%
> doc/snappy80%  129%
> index/deflate 1   49%  966%
> index/deflate 9   46%  938%
> index/snappy  65%  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org