[ 
https://issues.apache.org/jira/browse/LUCENE-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-6115:
---------------------------------
    Attachment: LUCENE-6115.patch

Here is a patch. Here are the differences with today:
 - when doing random access, the header is always completely decoded (while we 
previously sometimes stopped eagerly)
 - when merging, we make sure to only decompress a given block of documents a 
single time
 - checksums are verified up-front instead of on-the-fly (like most other 
formats actually)
 - I made the logic for large blocks more robust. Up to now, if you were 
storing a document that was so large that it would grow larger than 2x the 
chunk size, then it would be splitted into 16KB (exact this time) slices in 
order to make decompressing a bit more memory-efficient. The reader used to 
duplicate the writer logic (if block_len >= 2 * chunk_size) but this info is 
now encoded in the stream.

I did some benchmarking and there were no significant differences in terms of 
indexing speed or read speed, so having to read the whole header every time 
does not seem to hurt (since the bottleneck is likely decompressing documents). 
I tried to remove the specialized merging to see if it was still needed and 
unfortunately it seems so, I got merging times that were about 20% slower 
without specialized merging. (In that case specialized merging still 
decompresses and recompresses all the time, it only saves some decoding and 
reuses directly the serialized bytes of each document.)

> Add getMergeInstance to CompressingStoredFieldsReader
> -----------------------------------------------------
>
>                 Key: LUCENE-6115
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6115
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6115.patch
>
>
> CompressingStoredFieldsReader is currently terrible at merging with different 
> codecs or wrapped readers since it does not keep state. So if you want to get 
> 5 documents that come from the same block, it means that you will have to 
> decode the block header and decompress 5 times. It has some optimizations so 
> that if you want to get the 2nd doc of the block then it will stop 
> decompressing soon after the 2nd document, but it doesn't help much with 
> merging since we want all documents.
> We should implement getMergeInstance and have a different behaviour when 
> merging by decompressing everything up-front and then reusing for all 
> documents of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to