[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573633#comment-13573633
 ] 

Shai Erera commented on LUCENE-4609:
------------------------------------

That will be interesting to test. In order to test it "fairly", we should 
either test the decoder (that's what we usually test) through the abstracted 
code (i.e. via CategoryListIterator), or Gilad, if you can, please copy 
CountingFacetsCollector and inline the decoder code instead of the dgap+vint 
code? That will be simpler to test, with the least noise.

bq. I'm not sure SimpleIntEncoder was ever used

Mike and I tested it ... at some point :). I don't remember where we posted the 
results though, whether it was in email, GTalk or some issue. But I do remember 
that the results were less good than DGapVInt. We were always surprised by how 
fast DGapVInt is .. all the while we thought VInt is expensive, but it may not 
be so expensive ... at least not on the Wikipedia collection.

bq. the decoding speed is significantly faster

That's good, but Mike and I have already concluded that EncodingSpeed just .. 
lies :). It's a micro-benchmark, and while it showed significant improvements 
after I moved the encoders to bulk-API, on the real-world scenario it performed 
worse. I had to inline stuff and specialize it even more for it to beat the 
previous way things worked.

I will be glad if SemiPacked is faster .. but judging from past experience, I 
don't get my hopes too high :).

As for this encoding algorithm, it all depends on how many values actually fall 
into the 256 range. That's another problem w/ EncodingSpeed -- it uses a 
real-world scenario of a crazy application which encoded 2430 ordinals for a 
single document! You can see that the values that are encoded are small, by 
e.g. looking at the NOnes bits/int. I suspect that in real-life, there won't be 
many values that fall into that range, at least after some documents have been 
indexed, because when you have a single category per-dimension in a document, 
then there are not too many chances that their values will be "close".

But .. we should let luceneutil be the judge of that :). So Gilad, can you make 
a patch with a SemiPackedCountingCollector? And also modify the default that 
FacetCollector.create returns, so that it's easy to compare base (CountinFC) to 
comp (SemiPackedCFC). If you want to test the collector, then run 
TestDemoFacets (as-is) and CountingFCTest (modify the collector though) to make 
sure the Collector works.
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, 
> LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
> category ordinals. We have several such encoders, including VInt (default), 
> and block encoders.
> It would be interesting to implement and benchmark a 
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
> the max value you can see and (2) one that decides for each doc on the 
> optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to