[
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-3892:
---------------------------------------
Attachment: LUCENE-3892-direct-IntBuffer.patch
The For index is 5.2 GB vs 4.9 GB for vInt: not bad to have only 5%
increase in index size when using For PF (10M wikipedia index).
{quote}
Get more direct access to the file as an int[]; eg MMapDir could
expose an IntBuffer from its ByteBuffer (saving the initial copy
into byte[] that we now do).
{quote}
I tested this, by making hacked up changes to Billy's For patch
requiring MMapDirectory and pulling an IntBuffer directly from its
ByteBuffer, saving one copy of bytes into the byte[] first. But,
curiously, it didn't seem to improve things much:
{noformat}
Task QPS base StdDev base QPS for StdDev for Pct
diff
AndHighMed 24.32 0.60 14.24 0.41 -44% -
-38%
PKLookup 131.98 3.09 108.35 1.47 -20% -
-14%
AndHighHigh 5.36 0.18 4.66 0.02 -16% -
-9%
Phrase 1.48 0.02 1.33 0.10 -18% -
-2%
SloppyPhrase 1.40 0.04 1.26 0.03 -13% -
-5%
SpanNear 1.14 0.01 1.04 0.02 -10% -
-6%
IntNRQ 12.13 0.70 11.27 0.46 -15% -
2%
Prefix3 34.51 1.17 34.11 1.28 -8% -
6%
Fuzzy1 90.63 1.74 89.68 1.46 -4% -
2%
Respell 77.22 2.62 76.99 1.62 -5% -
5%
Wildcard 11.84 0.40 12.20 0.37 -3% -
9%
Fuzzy2 34.34 0.82 36.16 1.08 0% -
11%
TermBGroup1M1P 4.71 0.11 5.02 0.18 0% -
12%
OrHighMed 7.87 0.28 8.50 0.55 -2% -
19%
TermBGroup1M 3.47 0.03 3.78 0.03 7% -
11%
TermGroup1M 2.96 0.01 3.25 0.03 8% -
11%
OrHighHigh 3.55 0.12 3.91 0.21 0% -
20%
Term 9.72 0.28 10.87 0.44 4% -
19%
{noformat}
Maybe, instead, reading into an int[] and decoding from an int array
(hopefully avoiding bounds checks) will be faster than calling
IntBuffer.get for each encoded int...
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta,
> Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-direct-IntBuffer.patch,
> LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch,
> LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]