[ https://issues.apache.org/jira/browse/LUCENE-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-6115: --------------------------------- Attachment: LUCENE-6115.patch Here is a patch. Here are the differences with today: - when doing random access, the header is always completely decoded (while we previously sometimes stopped eagerly) - when merging, we make sure to only decompress a given block of documents a single time - checksums are verified up-front instead of on-the-fly (like most other formats actually) - I made the logic for large blocks more robust. Up to now, if you were storing a document that was so large that it would grow larger than 2x the chunk size, then it would be splitted into 16KB (exact this time) slices in order to make decompressing a bit more memory-efficient. The reader used to duplicate the writer logic (if block_len >= 2 * chunk_size) but this info is now encoded in the stream. I did some benchmarking and there were no significant differences in terms of indexing speed or read speed, so having to read the whole header every time does not seem to hurt (since the bottleneck is likely decompressing documents). I tried to remove the specialized merging to see if it was still needed and unfortunately it seems so, I got merging times that were about 20% slower without specialized merging. (In that case specialized merging still decompresses and recompresses all the time, it only saves some decoding and reuses directly the serialized bytes of each document.) > Add getMergeInstance to CompressingStoredFieldsReader > ----------------------------------------------------- > > Key: LUCENE-6115 > URL: https://issues.apache.org/jira/browse/LUCENE-6115 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: LUCENE-6115.patch > > > CompressingStoredFieldsReader is currently terrible at merging with different > codecs or wrapped readers since it does not keep state. So if you want to get > 5 documents that come from the same block, it means that you will have to > decode the block header and decompress 5 times. It has some optimizations so > that if you want to get the 2nd doc of the block then it will stop > decompressing soon after the 2nd document, but it doesn't help much with > merging since we want all documents. > We should implement getMergeInstance and have a different behaviour when > merging by decompressing everything up-front and then reusing for all > documents of the block. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org