[ 
https://issues.apache.org/jira/browse/LUCENE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5081:
---------------------------------

    Attachment: LUCENE-5081.patch

Here is an implementation of a compressed doc ID set. Although it is immutable, 
only supports sequential access and requires doc IDs to be provided in order at 
building time, it supports fast iteration (nextDoc), skipping (advance), union 
and intersection. The detailed format is a bit complex (see javadocs), but the 
rough idea is to store large sequences of null bytes as a VInt and sequences of 
non-null bytes as-is. This implementation can compress data as soon as it can 
find more than 2 consecutive null bytes. Moreover, even incompressible sets 
only require a few bytes more than FixedBitSet.

I ran a quick benchmark to measure the size (as reported by RamUsageEstimator) 
depending on the percentage of bits set on a bit set containing 2<sup>23</sup> 
elements (FixedBitSet requires 1MB) as well as the time required to iterate 
over all document IDs compared to FixedBitSet:
||Percentage of bits set||Size||Iteration time/FixedBitSet iteration time||
|0.001%|360 bytes|0.01|
|0.01%|2.8KB|0.10|
|0.1%|23.8 KB|0.38|
|1%|187.7 KB|0.80|
|10%|864 KB|1.3|
|50%|1 MB|2.5|
|90%|1 MB|2.3|
|100%|1 MB|1.7|

Even in the worst case, memory usage exceeds the memory usage of FixedBitSet by 
a few bytes, and iteration is 2.5 times slower.

The patch includes the set implementation but it is not used anywhere yet. I 
was thinking about using it automatically instead of FixedBitSet in 
CachingWrapperFilter but it looks like ToParentBlockJoinQuery expects to get a 
FixedBitSet from the cache.
                
> Compress doc ID sets
> --------------------
>
>                 Key: LUCENE-5081
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5081
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-5081.patch
>
>
> Our filters use bit sets a lot to store document IDs. However, it is likely 
> that most of them are sparse hence easily compressible. Having efficient 
> compressed sets would allow for caching more data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to