[
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555948#comment-13555948
]
Shai Erera commented on LUCENE-4609:
------------------------------------
How come the order of the lines in the two tables is different? At first I
thought you misread the table because the results don't look that bad :).
You can very easily push DGap into the encoder? But judging from LUCENE-4686,
this got us there 6% improvement, so not likely it will double the decoding
speed here.
Also, I see that you have a static MASK table. I wonder, given the recent
thread about FixedBitSet and static mask tables (people were against it because
a couple of bitwise ops would do better), if getting rid of it would improve
perf.
I think that the main problem with these ordinals encoding is that we encode
small groups of integers, usually. So just for fun, I ran EncodingSpeed with 4
encoders/decoders, encoding 2430 values in one go (single call made to encode).
The results are below:
{noformat}
Encoder Bits/Int Encode Time
Encode Time Decode Time Decode Time
[milliseconds]
[microsecond / int] [milliseconds] [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorting(Unique(DGap(PackedInts))) 11.0189 8561
85.6105 590 5.9000
Sorting(Unique(DGap(Packed))) 11.0058 8206
82.0605 806 8.0601
Sorting(Unique(DGap(VInt8))) 8.5597 7133
71.3305 263 2.6300
Sorting(Unique(DGapVInt8)) 8.5597 7208
72.0805 251 2.5100
{noformat}
* PackedInts is an encoder similar to what Gilad wrote here, which uses
PackedInts.Writer/ReaderNoHeader (writes a 1-byte header itself).
* Packed is Mike's encoder
* I wrapped all encoders with DGap (except for the last one which is
specialized)
Notice that:
* Both Packed versions achieve worse compression than VInt (confirms Mike's DV
files results above)
* Both DGap perform much faster at decode that the Packed versions.
* Surprisingly, but this may be a result of micro-benchmkaring, Packed (Mike's
specialized) decodes slower than PackedInts. I wonder if luceneutil would
confirm that too.
I think that if we had some statistics about the result Wikipedia index, e.g.
min/max ord per document (the final integer, after dgap), we could tell better
if there's any point to continue these investigations further. Since luceneutil
search perf is happier w/ VInt and more so, the result index is smaller by
~25%, I tend to think that packed ints won't get us far here.
With postings, I think that the numbers are smaller? e.g. for dense posting
lists, with docID gaps, you'll get a small bpv?
Mike, is it possible to print the min/max bpv used during the indexing? For
instance, I printed that info when running EncodingSpeed, and got 11 (there's a
single chunk to encode, so min=max). I wonder what the result will be for the
index. I added these as public static members to PackedIntEncoder, so I think
you can do the same for the index.
Also, 11 bpv means that every value is encoded int 1.5 bytes + 1 header byte.
So for 25 ords, we're looking at ~39 bytes per document. If most of the values
are small though (and this happens with ordinals, even with dgap, because the
first value is the largest), we pay a high penalty for each value, because bpv
will be large.
Maybe in order to beat VInt we need a smarter encoder. E.g. the facets package
have a FourFlags and EightFlags encoders which are very good for really low
numbers (Four for 1,2,3 and Eight for 1). Gilad had an idea above to use a
static bpv=6 and encode all values that are 1-63 as 0-62 in 6 bits, and values
that are 64+ are encoded as 63 (all 1's) followed by a VInt. I ran a short
analysis using EncodingSpeed (I should say here that these 2430 integers are
actual category ordinals of a real document from one application, so these are
not just made up numbers):
* maxValue=1928 (requires 11 bits)
* numValuesLessThan64=2169 (out of 2430!)
So if we had encoded using Gilad's idea, we would end up with
* 261 "large" values, at most 2 bytes-per-value = 522 bytes (4176 bits)
* 2169 values encoded with 6 bpv = 135.5625 longs = 1084.5 bytes (8676 bits)
* avg bits/int = 12852 / 2430 = 5.28
Which is better than VInt. I omitted the 1 byte header here, because of the
large number of values that are encoded at once, but it should add some
overhead for documents with very few categories (e.g.
OrdinalPolicy.NO_PARENTS). Also, this is the "best" compression that we could
get for these numbers, but if we really encoded them I suspect it would be
worse. Because we cannot reorder after sorting+dgap, so e.g. if we have a
series of 5-1 values (5 small, 1 large), we would waste 2 bits on the 5 small
ones. 2 bits here, 2 bits there, they add up.
Still, I wonder what would be the speed of such decoder. And what are the
statistics for the real Wikipedia index, with its 25 ords per doc. Encoders
like the one above greatly benefit when there are a large number of values
(because there are more chances for small values).
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
> Key: LUCENE-4609
> URL: https://issues.apache.org/jira/browse/LUCENE-4609
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/facet
> Reporter: Shai Erera
> Priority: Minor
> Attachments: LUCENE-4609.patch, LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the
> category ordinals. We have several such encoders, including VInt (default),
> and block encoders.
> It would be interesting to implement and benchmark a
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and
> the max value you can see and (2) one that decides for each doc on the
> optimal bitsPerValue, writes it as a header in the byte[] or something.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]