[
https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709713#action_12709713
]
Yonik Seeley commented on SOLR-1169:
The first results are in, and it's looking good.
Algorithm:
For intersectionSize (and intersection), if the sets are near in size, we do a
linear scan. A little micro-optimization there got us an extra 13% speedup...
just for rearranging the order of comparisons base on the fact that one was
more likely to be true than the other (based on the set sizes). When the set
sizes differ by a factor of 8 or more, we use a modified binary search. The
first modification keeps track of the lower bound rather than resetting to 0...
a big win (an obvious optimization... I didn't measure the benefit). Then
second modification probes closer to the lower bound (rather than using the
midpoint) based on the relative size difference of the sets, and leads to a 40%
performance increase.
Performance results of intersectionSize:
|DocSet sizes|Advantage|Percent|comments
|1-200 x 1-200|Int |53%| random sized DocSets from
1-200 in size intersected with other DocSets of size 1-200|
|1-3000 x 1-3000 |Int |33% |3000 is the current
default upper bound of HashDocSet
|1-3000 x 1-3000 | Int |130% |-client instead of -server
|2000-3000 x 2000-3000 | Int |60% | only big sets
|2-3 x 2-3| Int |54% | only bigger sets
|100-200 x 1000-2000 | Hash |87% | small sets with big sets
|1-1 x 2-3 | Int |74% | smaller sets intersected with
BitDocSets
|1-3 x 1-3 | Int |80% | docsets over maxDoc/64 are
BitDocSets (maxDoc=1M)
So to sum up, only small sets intersected with big sets are slower but
given that big sets intersected with big sets take a majority of the time, we
get a nice net win. It gets more dramatic when intersecting a small set with a
BitDocSet... these affects are probably due to nicer treatment of the memory
subsystem when accessing memory in order. I think these intersections tend to
be bound by memory bandwidth.
The improvements will also allow us to bump up the max size of the small set
implementation. From a memory consumption point of view, the break-even point
is maxDoc/32. When I tested using SortedIntDocSets with maxDoc/64, there was
always a net speedup over maxDoc/32 and maxDoc/100, so this seems to be a good
balance between performance and saving memory.
Memory savings: SortedIntDocSet is more efficient than HashDocSet at storing
the same amount of data, and it can be used at larger sizes (relative to
maxDoc) before performance decreases (another memory win).
Other savings: Faster set creation - Lucene currently delivers docs in order,
hence so sorting step is needed after collection.
Other notes: I tried a cool partitioning algorithm that I thought would be
superior - take the middle element of the big set and use it to partition the
small set. Say you have set sizes of 100 and 1... you do a single binary
search on the small set, and now for all 100 elements you have half the big set
size to search. Recurse on the corresponding lower and upper partitions until
they are small enough to use a different method such as a modified binary
search. This approach worked, but it wasn't able to beat the modified binary
search alone once I put in all the optimizations... *shrugs*
SortedIntDocSet
---
Key: SOLR-1169
URL: https://issues.apache.org/jira/browse/SOLR-1169
Project: Solr
Issue Type: Improvement
Reporter: Yonik Seeley
Assignee: Yonik Seeley
Fix For: 1.4
Attachments: SOLR-1169.patch
A DocSet type that can skip to support SOLR-1165
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.