[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

Adrien Grand (Jira) Thu, 25 Nov 2021 09:43:08 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449273#comment-17449273
 ]


Adrien Grand commented on LUCENE-10233:
---------------------------------------

Sorry for my lagging response and thanks for trying hard to make things work 
with SparseFixedBitSet. The performance is indeed a bit disappointing compared 
to your initial implementation and I can't think of an easy way to make it 
faster given that the bottleneck seems to be on allocations and that there is 
no clean way of reusing the SparseFixedBitSet across leaves.

Maybe we should go back to your initial approach (sorry for the 
back-and-forth!) using an approach that doesn't introduce a new BitSet 
implementation and introduces a new DocIdSetIterator class instead of reusing 
BitSetIterator in order to avoid the trap of extracting a BitSet from a 
BitSetIterator and ignoring the docBase?

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> ------------------------------------------------------------------
>
>                 Key: LUCENE-10233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10233
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Feng Guo
>            Priority: Major
>         Attachments: SparseFixedBitSet.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

Reply via email to