[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1483:
---------------------------------------
Attachment: sortCollate.py
sortBench.py
OK I ran a bunch of sort perf tests, on trunk & with the patch.
(Attached the two Python sources for doing this... though they require
some small local mods to run properly).
Each alg is run with "java -Xms1024M -Xmx1024M -Xbatch -server" on OS
X 10.5.5, java 1.6.0_07-b06-153.
I use two indexes, each with 2M docs. One is docs from Wikipedia
(labeled "wiki"), the other is SortableSimpleDocMaker docs augmented
to include a random country field (labeled "simple"). For each I
created 1-segment, 10-segment and 100-segment indices. I sort by
score, doc, string (val, ord = true ord+subord, ordval = ord +
fallback). Queue size is 10.
I ran various queries... query "147" hits ~5k docs, query "text" hits
~97K docs, query "1" hits 386K docs and alldocs query hits 2M docs. qps
is queries per sec and warm is time for first warmup query, from
trunk. qpsnew & warmnew are with patch. pctg shows % gain in
qps performance:
||numSeg||index||sortBy||query||topN||meth||hits||warm||qps||warmnew||qpsnew||pctg||
|1|wiki|score|147|10| | 4984| 0.2|5717.6| 0.2|5627.5| -1.6%|
|1|wiki|score|text|10| | 97191| 0.3| 340.9| 0.3| 348.8| 2.3%|
|1|wiki|score|1|10| | 386435| 0.3| 86.7| 0.3| 89.3| 3.0%|
|1|wiki|doc|147|10| | 4984| 0.3|4071.7| 0.3|4649.0| 14.2%|
|1|wiki|doc|text|10| | 97191| 0.3| 225.4| 0.3| 253.7| 12.6%|
|1|wiki|doc|1|10| | 386435| 0.3| 56.9| 0.3| 65.8| 15.6%|
|1|wiki|doc|<all>|10| |2000000| 0.1| 23.0| 0.1| 38.6| 67.8%|
|1|simple|int|text|10| |2000000| 0.7| 10.7| 0.7| 13.5| 26.2%|
|1|simple|int|<all>|10| |2000000| 0.6| 21.1| 0.6| 34.7| 64.5%|
|1|simple|country|text|10|ord|2000000| 0.6| 10.7| 0.6| 13.2| 23.4%|
|1|simple|country|text|10|ordval|2000000| 0.6| 10.7| 0.6| 13.3| 24.3%|
|1|simple|country|<all>|10|ord|2000000| 0.5| 20.7| 0.6| 32.5| 57.0%|
|1|simple|country|<all>|10|ordval|2000000| 0.5| 20.7| 0.6| 34.6| 67.1%|
|1|wiki|title|147|10|ord| 4984| 2.1|3743.8| 2.0|4210.5| 12.5%|
|1|wiki|title|147|10|ordval| 4984| 2.1|3743.8| 2.0|4288.2| 14.5%|
|1|wiki|title|text|10|ord| 97191| 2.1| 144.2| 2.1| 160.3| 11.2%|
|1|wiki|title|text|10|ordval| 97191| 2.1| 144.2| 2.1| 163.5| 13.4%|
|1|wiki|title|1|10|ord| 386435| 2.1| 51.2| 2.1| 63.2| 23.4%|
|1|wiki|title|1|10|ordval| 386435| 2.1| 51.2| 2.1| 64.6| 26.2%|
|1|wiki|title|<all>|10|ord|2000000| 2.1| 21.1| 2.1| 33.2| 57.3%|
|1|wiki|title|<all>|10|ordval|2000000| 2.1| 21.1| 2.1| 35.4| 67.8%|
||numSeg||index||sortBy||query||topN||meth||hits||warm||qps||warmnew||qpsnew||pctg||
|10|wiki|score|147|10| | 4984| 0.3|4228.3| 0.3|4510.6| 6.7%|
|10|wiki|score|text|10| | 97191| 0.3| 294.7| 0.3| 341.5| 15.9%|
|10|wiki|score|1|10| | 386435| 0.4| 75.0| 0.4| 87.0| 16.0%|
|10|wiki|doc|147|10| | 4984| 0.3|3332.2| 0.3|4033.9| 21.1%|
|10|wiki|doc|text|10| | 97191| 0.4| 217.0| 0.4| 277.0| 27.6%|
|10|wiki|doc|1|10| | 386435| 0.4| 54.6| 0.4| 70.5| 29.1%|
|10|wiki|doc|<all>|10| |2000000| 0.1| 12.7| 0.1| 38.6|203.9%|
|10|simple|int|text|10| |2000000| 1.2| 10.3| 0.6| 13.5| 31.1%|
|10|simple|int|<all>|10| |2000000| 1.1| 11.8| 0.8| 34.6|193.2%|
|10|simple|country|text|10|ord|2000000| 0.7| 10.4| 0.5| 13.2| 26.9%|
|10|simple|country|text|10|ordval|2000000| 0.7| 10.4| 0.5| 13.3| 27.9%|
|10|simple|country|<all>|10|ord|2000000| 0.7| 11.5| 0.5| 32.5|182.6%|
|10|simple|country|<all>|10|ordval|2000000| 0.7| 11.5| 0.5| 34.1|196.5%|
|10|wiki|title|147|10|ord| 4984| 12.5|3004.5| 2.1|3124.0| 4.0%|
|10|wiki|title|147|10|ordval| 4984| 12.5|3004.5| 2.1|3353.5| 11.6%|
|10|wiki|title|text|10|ord| 97191| 12.7| 139.4| 2.1| 156.7| 12.4%|
|10|wiki|title|text|10|ordval| 97191| 12.7| 139.4| 2.1| 160.9| 15.4%|
|10|wiki|title|1|10|ord| 386435| 12.7| 50.3| 2.1| 62.3| 23.9%|
|10|wiki|title|1|10|ordval| 386435| 12.7| 50.3| 2.1| 64.1| 27.4%|
|10|wiki|title|<all>|10|ord|2000000| 12.7| 11.4| 2.1| 33.1|190.4%|
|10|wiki|title|<all>|10|ordval|2000000| 12.7| 11.4| 2.1| 35.3|209.6%|
||numSeg||index||sortBy||query||topN||meth||hits||warm||qps||warmnew||qpsnew||pctg||
|100|wiki|score|147|10| | 4984| 0.3|1282.2| 1.7|1162.3| -9.4%|
|100|wiki|score|text|10| | 97191| 0.4| 232.4| 1.3| 275.6| 18.6%|
|100|wiki|score|1|10| | 386435| 0.4| 65.1| 1.4| 80.4| 23.5%|
|100|wiki|doc|147|10| | 4984| 0.4|1170.0| 0.4|1132.0| -3.2%|
|100|wiki|doc|text|10| | 97191| 0.4| 171.7| 0.4| 230.1| 34.0%|
|100|wiki|doc|1|10| | 386435| 0.4| 46.7| 0.4| 67.9| 45.4%|
|100|wiki|doc|<all>|10| |2000000| 0.2| 7.8| 0.1| 41.6|433.3%|
|100|simple|int|text|10| |2000000| 3.3| 8.9| 4.0| 11.0| 23.6%|
|100|simple|int|<all>|10| |2000000| 3.4| 7.7| 1.1| 36.5|374.0%|
|100|simple|country|text|10|ord|2000000| 1.0| 8.8| 0.6| 10.8| 22.7%|
|100|simple|country|text|10|ordval|2000000| 1.0| 8.8| 0.6| 11.0| 25.0%|
|100|simple|country|<all>|10|ord|2000000| 1.0| 7.6| 0.5| 35.0|360.5%|
|100|simple|country|<all>|10|ordval|2000000| 1.0| 7.6| 0.5| 36.3|377.6%|
|100|wiki|title|147|10|ord| 4984| 94.6|1066.9| 2.1| 583.7|-45.3%|
|100|wiki|title|147|10|ordval| 4984| 94.6|1066.9| 2.1| 750.1|-29.7%|
|100|wiki|title|text|10|ord| 97191| 94.9| 110.2| 2.1| 122.7| 11.3%|
|100|wiki|title|text|10|ordval| 97191| 94.9| 110.2| 2.1| 128.4| 16.5%|
|100|wiki|title|1|10|ord| 386435| 94.3| 47.9| 2.1| 58.2| 21.5%|
|100|wiki|title|1|10|ordval| 386435| 94.3| 47.9| 2.1| 60.1| 25.5%|
|100|wiki|title|<all>|10|ord|2000000| 94.6| 7.8| 2.5| 35.6|356.4%|
|100|wiki|title|<all>|10|ordval|2000000| 94.6| 7.8| 2.4| 37.0|374.4%|
It's a ridiculous amount of of data to digest... but here are some
initial thoughts:
* These are only single term queries; I'd expect for multi term
queries the gain would be less since net/net less %tg of the time
is spent collecting.
* Ord + val fallback (ordval) is generally faster than pure
ord/subord. I think for now we should run with ord + val
fallback? (We can leave ord + subord commented out?).
* It's great that we see decent speedups for "sort by score" which
is presumably the most common sort used.
* We do get slower in certain cases (neg pctg in the rightmost
column): all not-in-the-noise slowdowns were with query "147" on
the 100 segment index. This query hits relatively few docs (~5K)
So, this is expected, because the new approach spends some time
updating its queue for each subreader. If the time spent
searching is relatively tiny then this queue update time becomes
relatively big. I think with larger queue size the slowdown will
be worse. However, I think this is an acceptable tradeoff.
* The gains for field sorting on a single segment (optimized) index
are impressive. Generally, the more hits encountered the better
the gains. It's amazing that we see ~67.8% gain sorting by docID,
country, and title for alldocs query. My only guess for this is
better cache hit rate (because we gain locality by copying values
to local compact arrays).
* Across the board the alldocs query shows very sizable (5X faster
for 100 segment index; 3X faster for 10 segment index)
improvements.
* I didn't break out %tg difference, but warming time with the patch
is waaaay faster than trunk when index has > 1 segment. Reopen
time should also be fantastically faster (we are sidestepping
something silly happing w/ FieldCache on a Multi*Reader). Warming
on trunk takes ~95 seconds with the 100 segment index!
> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 2.9
> Reporter: Mark Miller
> Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing
> for individual segment reloading on reopen.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]