[ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
------------------------------
    Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and {{intersect}} will get into {{addAll}} logic. If we 
store ids as bitset, and give the IntersectVisitor bulk visiting ability, we 
can speed up addAll because we can just execute the 'or' logic between the 
result and the block ids.

Optimization will be triggered when the following conditions are met at the 
same time:
 # leafCardinality = 1
 # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too 
much storage)
 # no duplicate doc id

I mocked a field that has 10,000,000 docs per value and search it with a 1 term 
PointInSetQuery, the build scorer time decreased from 71ms to 8ms.

(WIP, Just post this first to see whether you think this optimization makes 
sense)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit 
ability, we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

I mocked a field that has 10,000,000 docs per value and search it with a 
PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) <= n * count?)
2. MergeReader will become a bit slower because it needs to iterate docIds one 
by one. 


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> ------------------------------------------------------------------
>
>                 Key: LUCENE-10233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10233
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Feng Guo
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.
> (WIP, Just post this first to see whether you think this optimization makes 
> sense)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to