[jira] Commented: (SOLR-1169) SortedIntDocSet

2009-05-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709605#action_12709605
 ] 

Yonik Seeley commented on SOLR-1169:


We need skipping in our DocSets in order to be able to pass them as filters and 
improve search performance.
One could Sort a HashDocSet every time it's used but that's not desirable.

The way Lucene now uses filters, random access performance is no longer 
important, but being able to efficiently skip is.
Intersection performance is very important to Solr of course, so if we can get 
SortedIntDocSet performance fast enough then it would make sense for it to 
replace HashDocSet.

I've already started working on this, and the results look promising.  I'll 
post a draft soon.

 SortedIntDocSet
 ---

 Key: SOLR-1169
 URL: https://issues.apache.org/jira/browse/SOLR-1169
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
Assignee: Yonik Seeley
 Fix For: 1.4


 A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1169) SortedIntDocSet

2009-05-14 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709645#action_12709645
 ] 

Mike Klaas commented on SOLR-1169:
--

sweet.  intersecting sorted int dicts should be faster in the general case.  
HashSet will of course win when one set is very small, but I expect this to 
still be pretty fast anyway.

 SortedIntDocSet
 ---

 Key: SOLR-1169
 URL: https://issues.apache.org/jira/browse/SOLR-1169
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
Assignee: Yonik Seeley
 Fix For: 1.4


 A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1169) SortedIntDocSet

2009-05-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709713#action_12709713
 ] 

Yonik Seeley commented on SOLR-1169:


The first results are in, and it's looking good.

Algorithm:
For intersectionSize (and intersection), if the sets are near in size, we do a 
linear scan.  A little micro-optimization there got us an extra 13% speedup... 
just for rearranging the order of comparisons base on the fact that one was 
more likely to be true than the other (based on the set sizes).  When the set 
sizes differ by a factor of 8 or more, we use a modified binary search.  The 
first modification keeps track of the lower bound rather than resetting to 0... 
a big win (an obvious optimization... I didn't measure the benefit).  Then 
second modification probes closer to the lower bound (rather than using the 
midpoint) based on the relative size difference of the sets, and leads to a 40% 
performance increase.

Performance results of intersectionSize:

|DocSet sizes|Advantage|Percent|comments
|1-200 x 1-200|Int |53%| random sized DocSets from 
1-200 in size intersected with other DocSets of size 1-200|
|1-3000 x 1-3000   |Int |33% |3000 is the current 
default upper bound of HashDocSet
|1-3000 x 1-3000   |   Int |130% |-client instead of -server
|2000-3000 x 2000-3000 |  Int |60% | only big sets
|2-3 x 2-3|  Int |54% |  only bigger sets
|100-200 x 1000-2000  |  Hash |87% | small sets with big sets
|1-1 x 2-3 |  Int  |74% | smaller sets intersected with 
BitDocSets
|1-3 x 1-3  |  Int |80% | docsets over maxDoc/64 are 
BitDocSets (maxDoc=1M)

So to sum up, only small sets intersected with big sets are slower but 
given that big sets intersected with big sets take a majority of the time, we 
get a nice net win.  It gets more dramatic when intersecting a small set with a 
BitDocSet... these affects are probably due to nicer treatment of the memory 
subsystem when accessing memory in order.  I think these intersections tend to 
be bound by memory bandwidth.

The improvements will also allow us to bump up the max size of the small set 
implementation.  From a memory consumption point of view, the break-even point 
is maxDoc/32.  When I tested using SortedIntDocSets with maxDoc/64, there was 
always a net speedup over maxDoc/32 and maxDoc/100, so this seems to be a good 
balance between performance and saving memory.

Memory savings: SortedIntDocSet is more efficient than HashDocSet at storing 
the same amount of data, and it can be used at larger sizes (relative to 
maxDoc) before performance decreases (another memory win).

Other savings: Faster set creation - Lucene currently delivers docs in order, 
hence so sorting step is needed after collection.

Other notes: I tried a cool partitioning algorithm that I thought would be 
superior - take the middle element of the big set and use it to partition the 
small set.  Say you have set sizes of 100 and 1... you do a single binary 
search on the small set, and now for all 100 elements you have half the big set 
size to search.  Recurse on the corresponding lower and upper partitions until 
they are small enough to use a different method such as a modified binary 
search.  This approach worked, but it wasn't able to beat the modified binary 
search alone once I put in all the optimizations... *shrugs*


 SortedIntDocSet
 ---

 Key: SOLR-1169
 URL: https://issues.apache.org/jira/browse/SOLR-1169
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
Assignee: Yonik Seeley
 Fix For: 1.4

 Attachments: SOLR-1169.patch


 A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.