[
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-4609:
---------------------------------------
Attachment: LUCENE-4609.patch
Patch, w/ a "custom" (not using our PackedInts APIs) packed ints
encoder/decoder. It only uses as many bytes as are necessary, and packs bpv &
"leftoverBits" into a single byte header.
I tested on first 1M Wikipedia docs ... and performance is much worse than
current default in trunk... admittedly it's not quite fair (trunk has
specialized vInt/dGap decoder, but patch leaves dGap separate from packed int
decode), and admittedly this decoder will be slower than the optimized
oal.util.PackedInts ... but perf is so far off that I find it hard to believe
PackedInts can match vInt even after optimizing.
Trunk gets these results:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
PKLookup 203.77 (1.8%) 202.25 (1.8%)
-0.7% ( -4% - 2%)
HighTerm 20.43 (1.8%) 20.53 (0.8%)
0.5% ( -2% - 3%)
MedTerm 33.12 (1.7%) 33.30 (0.9%)
0.5% ( -2% - 3%)
LowTerm 87.55 (3.0%) 88.59 (2.5%)
1.2% ( -4% - 6%)
{noformat}
Patch gets this:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
HighTerm 10.82 (3.6%) 10.69 (4.4%)
-1.2% ( -8% - 7%)
MedTerm 19.33 (3.2%) 19.10 (4.0%)
-1.2% ( -8% - 6%)
LowTerm 67.75 (2.8%) 67.11 (3.0%)
-0.9% ( -6% - 5%)
PKLookup 196.49 (1.0%) 196.24 (1.9%)
-0.1% ( -3% - 2%)
{noformat}
(NOTE: base/comp are the same in each run, so ignore the differences w/in each
run (it's noise) and compare absolute across the two runs ... ie HighTerm gets
~20.43 QPS with trunk but ~10.82 with patch).
Also: trunk took ~63 MB for the DV files while patch took ~84 MB. Net/net I
think postings compress better with PackedInts than facet ords (at least for
these 9 facet fields I'm using in Wikipedia)...
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
> Key: LUCENE-4609
> URL: https://issues.apache.org/jira/browse/LUCENE-4609
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/facet
> Reporter: Shai Erera
> Priority: Minor
> Attachments: LUCENE-4609.patch, LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the
> category ordinals. We have several such encoders, including VInt (default),
> and block encoders.
> It would be interesting to implement and benchmark a
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and
> the max value you can see and (2) one that decides for each doc on the
> optimal bitsPerValue, writes it as a header in the byte[] or something.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]