[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555948#comment-13555948
 ] 

Shai Erera commented on LUCENE-4609:
------------------------------------

How come the order of the lines in the two tables is different? At first I 
thought you misread the table because the results don't look that bad :).

You can very easily push DGap into the encoder? But judging from LUCENE-4686, 
this got us there 6% improvement, so not likely it will double the decoding 
speed here.
Also, I see that you have a static MASK table. I wonder, given the recent 
thread about FixedBitSet and static mask tables (people were against it because 
a couple of bitwise ops would do better), if getting rid of it would improve 
perf.

I think that the main problem with these ordinals encoding is that we encode 
small groups of integers, usually. So just for fun, I ran EncodingSpeed with 4 
encoders/decoders, encoding 2430 values in one go (single call made to encode). 
The results are below:

{noformat}
Encoder                            Bits/Int          Encode Time                
Encode Time          Decode Time                Decode Time
                                                   [milliseconds]        
[microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorting(Unique(DGap(PackedInts)))  11.0189                 8561                 
   85.6105                  590                     5.9000
Sorting(Unique(DGap(Packed)))      11.0058                 8206                 
   82.0605                  806                     8.0601
Sorting(Unique(DGap(VInt8)))        8.5597                 7133                 
   71.3305                  263                     2.6300
Sorting(Unique(DGapVInt8))          8.5597                 7208                 
   72.0805                  251                     2.5100
{noformat}

* PackedInts is an encoder similar to what Gilad wrote here, which uses 
PackedInts.Writer/ReaderNoHeader (writes a 1-byte header itself).
* Packed is Mike's encoder
* I wrapped all encoders with DGap (except for the last one which is 
specialized)

Notice that:
* Both Packed versions achieve worse compression than VInt (confirms Mike's DV 
files results above)
* Both DGap perform much faster at decode that the Packed versions.
* Surprisingly, but this may be a result of micro-benchmkaring, Packed (Mike's 
specialized) decodes slower than PackedInts. I wonder if luceneutil would 
confirm that too.

I think that if we had some statistics about the result Wikipedia index, e.g. 
min/max ord per document (the final integer, after dgap), we could tell better 
if there's any point to continue these investigations further. Since luceneutil 
search perf is happier w/ VInt and more so, the result index is smaller by 
~25%, I tend to think that packed ints won't get us far here.

With postings, I think that the numbers are smaller? e.g. for dense posting 
lists, with docID gaps, you'll get a small bpv?
Mike, is it possible to print the min/max bpv used during the indexing? For 
instance, I printed that info when running EncodingSpeed, and got 11 (there's a 
single chunk to encode, so min=max). I wonder what the result will be for the 
index. I added these as public static members to PackedIntEncoder, so I think 
you can do the same for the index.

Also, 11 bpv means that every value is encoded int 1.5 bytes + 1 header byte. 
So for 25 ords, we're looking at ~39 bytes per document. If most of the values 
are small though (and this happens with ordinals, even with dgap, because the 
first value is the largest), we pay a high penalty for each value, because bpv 
will be large.

Maybe in order to beat VInt we need a smarter encoder. E.g. the facets package 
have a FourFlags and EightFlags encoders which are very good for really low 
numbers (Four for 1,2,3 and Eight for 1). Gilad had an idea above to use a 
static bpv=6 and encode all values that are 1-63 as 0-62 in 6 bits, and values 
that are 64+ are encoded as 63 (all 1's) followed by a VInt. I ran a short 
analysis using EncodingSpeed (I should say here that these 2430 integers are 
actual category ordinals of a real document from one application, so these are 
not just made up numbers):

* maxValue=1928 (requires 11 bits)
* numValuesLessThan64=2169 (out of 2430!)

So if we had encoded using Gilad's idea, we would end up with
* 261 "large" values, at most 2 bytes-per-value = 522 bytes (4176 bits)
* 2169 values encoded with 6 bpv = 135.5625 longs = 1084.5 bytes (8676 bits)
* avg bits/int = 12852 / 2430 = 5.28

Which is better than VInt. I omitted the 1 byte header here, because of the 
large number of values that are encoded at once, but it should add some 
overhead for documents with very few categories (e.g. 
OrdinalPolicy.NO_PARENTS). Also, this is the "best" compression that we could 
get for these numbers, but if we really encoded them I suspect it would be 
worse. Because we cannot reorder after sorting+dgap, so e.g. if we have a 
series of 5-1 values (5 small, 1 large), we would waste 2 bits on the 5 small 
ones. 2 bits here, 2 bits there, they add up.

Still, I wonder what would be the speed of such decoder. And what are the 
statistics for the real Wikipedia index, with its 25 ords per doc. Encoders 
like the one above greatly benefit when there are a large number of values 
(because there are more chances for small values).
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
> category ordinals. We have several such encoders, including VInt (default), 
> and block encoders.
> It would be interesting to implement and benchmark a 
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
> the max value you can see and (2) one that decides for each doc on the 
> optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to