[
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4226:
---------------------------------
Attachment: LUCENE-4226.patch
New version of the patch. It contains a few enhancements:
- Merge optimization: whenever possible the StoredFieldsFormat tries to copy
compressed data instead of uncompressing it into a buffer before compressing
back to an index output,
- New options for the stored fields index: there are 3 strategies that allow
different memory/perf trade-offs:
** leaving it fully on disk (same as Lucene40, relying on the O/S cache),
** loading the position of the start of the chunk for every document into
memory (requires up to 8 * numDocs bytes, no disk access),
** loading the position of the start of the chunk and the first doc ID it
contains for every chunk (requires up to 12 * numChunks bytes, no disk access,
interesting if you have large chunks of compressed data).
- Improved memory usage and compression ratio (but a little slower) for
CompressionMode.FAST (using packed ints).
- Try to save 1 byte per field by storing the field number and the bits
together.
- More tests.
So in the end, this StoredFieldsFormat tries to make disk seeks less likely by:
- giving the ability to load the stored fields index into memory (you never
need to seek to find the position of the chunk that contains you document),
- reducing the size of the fields data file (.fdt) so that the O/S cache can
cache more documents.
Out of curiosity, I tested whether it could be faster for LZ4 to use
intermediate buffers for compression and/or uncompression, and it is slower
than accessing the index input/output directly (at least with MMapDirectory).
I hope I'll have something committable soon.
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Trivial
> Fix For: 4.1, 5.0
>
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java,
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch,
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common
> for an index with stored fields enabled to have most of its space used by the
> .fdt index file. To prevent this .fdt file from growing too much, one option
> is to compress stored fields. Although compression works rather well for
> large fields, this is not the case for small fields and the compression ratio
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of
> data. To see how it behaves in terms of document deserialization speed and
> compression ratio, I've run several tests with different index compression
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text
> were indexed and stored):
> - no compression,
> - docs compressed with deflate (compression level = 1),
> - docs compressed with deflate (compression level = 9),
> - docs compressed with Snappy,
> - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and
> chunks of 6 docs,
> - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and
> chunks of 6 docs,
> - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6
> docs.
> For those who don't know Snappy, it is compression algorithm from Google
> which has very high compression ratios, but compresses and decompresses data
> very quickly.
> {noformat}
> Format Compression ratio IndexReader.document time
> ————————————————————————————————————————————————————————————————
> uncompressed 100% 100%
> doc/deflate 1 59% 616%
> doc/deflate 9 58% 595%
> doc/snappy 80% 129%
> index/deflate 1 49% 966%
> index/deflate 9 46% 938%
> index/snappy 65% 264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with
> deflate, the .fdt file shrinks by a factor of 2, much better than with
> doc-level compression). One other interesting thing is that {{index/snappy}}
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for
> compressed fields (one can expect better results for formats that have a high
> compression ratio since they probably require fewer read/write operations
> from disk).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]