[
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288675#comment-13288675
]
Michael McCandless commented on LUCENE-3892:
--------------------------------------------
Excellent! All tests also pass for me w/ PFor postings format as
well... this is a great starting point :) One Solr test failed
(ContentStreamTest)... but I think it was false failure...
I did notice the tests seem to run slower, especially certain ones eg
TestJoinUtil.
Still missing a couple license headers (TestMin, TestCompress)...
I ran a quick perf test using
http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc
Wikipedia index.
Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs
1261 sec).
But more important is the slower search times:
{noformat}
Task QPS base StdDev base QPS pfor StdDev pfor Pct
diff
Phrase 8.52 0.50 4.43 0.40 -55% -
-39%
SloppyPhrase 12.52 0.39 7.87 0.51 -43% -
-30%
AndHighMed 67.69 2.82 44.22 1.47 -39% -
-29%
SpanNear 5.19 0.12 3.90 0.28 -31% -
-17%
PKLookup 112.16 1.71 95.61 1.30 -17% -
-12%
AndHighHigh 13.22 0.34 11.86 0.72 -17% -
-2%
Wildcard 46.04 0.37 41.68 4.45 -19% -
1%
Fuzzy1 50.11 2.03 48.06 1.91 -11% -
3%
OrHighMed 9.26 0.48 8.90 0.37 -12% -
5%
OrHighHigh 12.28 0.56 11.83 0.49 -11% -
5%
TermBGroup1M1P 40.47 1.94 39.88 2.51 -11% -
10%
Fuzzy2 53.71 2.66 53.01 2.08 -9% -
7%
TermGroup1M 36.46 1.21 35.99 1.58 -8% -
6%
TermBGroup1M 55.53 1.99 55.26 2.68 -8% -
8%
Respell 69.71 4.49 69.73 2.07 -8% -
10%
Term 94.38 7.62 94.96 12.19 -18% -
23%
Prefix3 41.63 0.34 42.21 5.82 -13% -
16%
IntNRQ 7.08 0.15 7.28 1.29 -17% -
23%
{noformat}
The queries that do skipping are quite a bit slower; this makes sense,
since on skip we do a full block decode. A smaller block size (we use
128 now right?) should help I think.
It's strange that the non-skipping queries (Term, OrHighMed,
OrHighHigh) don't show any performance gain ... maybe we need to
optimize the decode... or it could be the removal of the bulk api
is hurting us here.
I'm also curious if we tried a pure FOR (no patching, so we must set
numBits according to the max value = larger index but hopefully faster
decode) if the results would improve...
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta,
> Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch,
> LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]